Extensive off-highway vehicle (OHV) activities can degrade fish habitat. Reclamation activities and reductions in public motorized access have been initiated in Alberta to support native trout recovery. In Alberta, three watersheds are targeted for remediation: Rocky Creek, Fall Creek, and Mackenzie Creek. The Alberta Ministry of Environment and Parks is monitoring fish populations in these watersheds to examine if the reclamation activities result in a measurable change to the density of immature and adult fish, particularly bull trout. The density of fish will be measured using electro-fishing methods as described in AEP (2018) and the catch per unit effort (CUE) using a standardized protocol will be the response variable.
The purpose of this document is to demonstrate how the collected data will be analyzed to examine if change has occurred in the mean CUE. Because restoration activities are ongoing, there is little/no data currently collected post-restoration. Consequently, simulated data will be used to demonstrate the analyses.
There are several levels of variation in the models in this project.
Stream-to-stream variation. Some streams are naturally more productive than other streams. In any model that combines results from several streams (e.g. the BACI models), a random effect for stream should be present in the statistical model. Fortunately, because the same stream is repeatedly measured over time, the stream serves as its own control and so the stream-to-stream variation has little impact on the precision of the results.
Year-to-year variation. CUE can vary across years at the same location due to year-specific effects (also known as process error) that affect all streams simultaneously. For example, a colder-than-normal spring may depress water temperatures during the sampling window which could have unspecified effects on the CUE. These year-specific effects tend to push all of the data for a year upwards or downwards, which violates a key assumption of independence in regular regression and ANOVA models as outlined in Schwarz (2018, 2017). The presence of year-specific effects is accounted for through a random effect of year in the models below.
Stream-year interaction variation. Ideally, while the mean level of the response in different stream would respond in parallel to year-specific effects, this is unlikely to occur. The stream-year interaction variation measures the non-parallelism in response of streams to year-specific effects. This variation is often the limiting factor in the power of a design to detect effects – refer to Schwarz (2018, 2017) for more details.
Site-to-Site variation within a stream. The CUE on a stream in a specific year will not be the same at each site measured. This lowest level of variation represents site-specific effects for a particular year, sampling artefacts, measuring error, etc. Multiple sites measured on a stream in a year are pseudo-replicates. Consequently, the values should either be averaged before the analysis, or a mixed model that incorporates stream-year random effects need to be fit. If the designs below are balanced, e.g. every stream measured at the same number of sites in each year, the two approaches are mathematically equivalent.
To account for these sources of variation, mixed linear models will be used with appropriate random effects as needed. See the individual analyses below for more details.
The \(\mathit{lmerTest}\) package (Kuznetsova et al., 2018) in the \(R\) software system (R Core Team, 2018) has been used. This package uses Restricted Maximum Likelihood (REML) to fit the models and computes the usual estimates, standard errors, and p-values for effects. Post-hoc multiple comparisons among effects are computed using the \(\mathit{emmeans}\) package (Lenth, 2019) which provides a unified way to estimate, for example, pair-wise comparisons such as difference in the mean between the before- and after- restoration periods, or BACI effects without relying on knowledge of \(R\)’s internal coding system.
It is always important to do a model assessment to ensure that the assumptions of the model are satisfied. In this analysis the key assumptions are
It is possible to extract the estimates of the residuals and random effects and create residual and caterpillar plots. A custom plotting function (sf.autoplot.lmer), suitable for mixed linear models and patterned after similar plots in the ggfortify package for linear models, is available at: [http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms/schwarz.functions.r] and are utilized within this report. A copy of the function is included in the directory with this report.
A typical diagnostic plot for one of the models (a before/after study) is:
plot(sf.autoplot.lmer(ba.sim.fits[[1]]$fit))
The two plots in the top row examine the distribution of the residuals, the difference between the observed and fitted data. The upper left plot should show a random distribution around 0. An outlier will show itself as point(s) that are far from the rest of the points. The upper right plot is a normal probability plot. If the residuals follow a normal distribution, the points should lie close to the reference line. Non-normality and outliers will show themselves as point(s) far from the line, or a systematic departure from the reference line, especially in the tails.
Outliers should be identified because they have the potential to influence the estimates when the number of sites per year is small. When the number of sites in year becomes larger, the impact of an outlier is lessened.
The impact of non-normality is substantially less. In most cases, year-specific effects (process error) are the limiting factor that determines the precision of the estimates and the power to detect effects. Furthermore, because sites in a stream-year are pseudo-replicates, the analysis essentially uses the average logCUE over sites in a stream-year in the analysis. The Central Limit Theorem indicates that averages of any distribution tend towards normality as sample sizes increase, so the impacts of non-normality in the residuals are expected to be slight.
The remainder of the plots are caterpillar plots, one for each random effect (aside from the residuals) in the model. The blue dots are the estimated value of the specific value for a level of the random effect (in this case for a particular year) and the whiskers represent the uncertainty in this value. If the random effect variation is small, the blue dots should be almost vertical along the 0 (vertical) line.
The blue dots should show an approximate normal distribution around the 0 line (you need to imagine the density of the blue dots projected downwards). However, unless you have 20+ levels of the random effect, you are unlikely to be able to detect non-normality.
A single random effect value that stands out substantially from the rest may be a bit worrisome – it represents a level of the random factor (in this case a year) where the year-specific effect is an “outlier”. Outliers can lead to larger standard errors in the estimates and reduced power to detect effects in before/after studies because it says that the mean logCUE for a stream next year could be substantially different than the mean logCUE this year due to factors out of your control. Fortunately, in BACI designs, the impact of a large year-specific effect occurs on both the impact and control streams and so “cancels” out when the BACI comparison is done.
The question arises as to whether such outlier random effects should be removed, in this case, by dropping a year of data? There is no simple answer to this – but if the year-specific effect is caused by a 1-in-a-thousand-year-flood, there are good reasons to drop this year.
Similarly, random effects for streams (not shown) would be interpreted in the same way. Again, because all streams are measured before and after the restoration action, stream-effects again “cancel” in the BACI analysis (each stream serves as its own block) and so “outliers” in the caterpillar plots for stream-effects are of no consequence.
Finally, random effects for year-stream interactions in the BACI model (not shown) represent non-parallelism over time in the response of a stream to year-specific effects. Here it is important to scrutinize outliers with care because the year-stream interaction variation drives the precision and power of the design.
In many cases, the simplest approach is to run the analysis with the data in question included and excluded and see if the final conclusion changes.
We simulated six datasets in a before/after scenario – similar findings are expected for the other models in this documents. There were 8 years in the study, with half in the before restoration period and half in the after restoration period. In each year, we simulated either 4 or 10 sites per year. Three scenarios were considered - no violations of assumptions; a single (large) outlier; and a non-normal distribution for the residuals that is symmetric but has a much wider spread then a normal distribution.
Plots of the simulated data are:
The outlier in the first year is evident in the two lower plots; the impact of non-normality is not obvious in the middle two plots and could look like (small) outliers as seen in the plots. With only a small number of sites measured in a stream-year, it is unlikely that you will detect non-normality based on a plot of the data.
We fit the before/after model (see below) to the simulated data and extract the residual plots (the two top most plots in the diagnostic plot from the previous section).
The presence of the outlier is evident in either of the plots of residuals vs. fitted values or the normal probability plots regardless of sample size. The non-normality shows itself only in the normal probability plots and is more evident with 10 sites per year.
The estimated effect sizes were computed for each simulated dataset:
## Assumptions n.sites.per.year contrast estimate SE df p.value
## 1 No violations 4 After - Before 1.40 0.34 6 0.01
## 2 No violations 10 After - Before 1.22 0.33 6 0.01
## 3 Non normal site-effects 4 After - Before 1.27 0.40 6 0.02
## 4 Non normal site-effects 10 After - Before 1.14 0.37 6 0.02
## 5 Outlier 4 After - Before 1.10 0.68 6 0.16
## 6 Outlier 10 After - Before 1.20 0.42 6 0.03
The effect of the outlier is to cause a negative bias in the estimate because the outlier occurred in the before period and so increased the estimates of the mean in the before period relative to the after period. The bias decreases with increasing sample size per year as the rest of the data swamps this outlier.
The impact of non-normality is slight when the sample size is larger. When the sample size per stream-year is small, the non-normality of a single site can look like an outlier with similar effects.
As noted earlier, the impact of a large year-specific effect in a before/after study can be substantial, but the impact is less in BACI designs because each stream is measured in each year and so stream- and year-random effects cancel in the analysis.
We will again illustrate the impacts using simulated data - in this case a BACI design with 2 streams in each of the impact and control groups and 4 sites measured per stream-year.
A plot of the simulated data:
shows four scenarios where an outlier exists on one of the random components.
The diagnostic plots from the four scenarios appear below:
The outlier is apparent in all the plots except for that of stream. With only a few streams it is difficult to detect a large stream-effect, but as noted previously, it does not affect the BACI analysis.
Estimates of BACI effect sizes can be computed under each scenario:
## Assumptions contrast estimate SE df p.value
## 1 No violations baci -0.01 0.15 20 0.94
## 2 Outlier - Stream Effect baci -0.01 0.15 20 0.94
## 3 Outlier - Year Effect baci -0.01 0.15 20 0.94
## 4 Outlier - Year-Stream Effect baci 0.99 0.93 20 0.30
Only the outlier in the stream-year random effects has any influence on the estimates or the p-values for the reasons explained earlier.
Sampling at the lowest level, i.e., at the site, can be done using a new set of sites for each stream-year combination (consistent with previous sampling protocols) or by revisiting the same sites over time (proposed changes to the sampling protocols), or a mixture of the two.
The analyses presented below implicitly assume that sites change from year-to-year within a stream. If some sites are revisited, this approach is slightly inefficient (i.e. reported standard errors and reported p-values will be slightly too large) because revisiting sites over time can account for some of the site-to-site variation in measuring trends. However, in many cases, the limiting factors to detecting trends are year-specific effects (process error) and stream-year interactions and so the impact is expected to be slight.
Modifications to the statistical models to account for some sites being repeated measured over time would involve the addition of a new term representing the specific sites visited (e.g. \((1|Site)\)) where the \(\mathit{Site}\) variable is an alphanumeric code for the individual sites that remains consistent over years (e.g. based on latitude-longitude of the site location). This addition to the model is appropriate if all sites are repeated measured over time, or if only some of the sites are repeatedly measured over time. If none of the sites have been measured over time, then this term is confounded with the residual variation and can be deleted (as shown in the models below).
The tradition BACI analysis with multiple years before and/or after the restoration occurs, assumes that all sites are measured in each year. The years do not have to be consecutive; for example, sampling could occur in 2015, 2017, and 2018 before restoration, and the BACI analyses presented below would be unchanged.
The BACI analysis below also allows for missing sampling in some sites in some years. For example, suppose restoration occurs between 2018 and 2019. Control site 1 could be measured in 2015, 2017, 2018, 2019, 2020 and 2021; control site 2 could be measured in 2015, 2018, 2019, 2020; and the restored sited measured in 2015, 2018, 2019, 2021 assuming the restoration occurred between 2018 and 2019.
However, a design where control sites are measured in 2015, 2017, 2019, and 2021 while restoration sites are measured in 2016, 2018, 2020, and 2022 with restoration taking place between 2018 and 2019 will cause problems in the analysis. Technically, the design is not connected. In unconnected designs, there is no contrast among the observations that leads to estimates of the BACI effect that are not confounded with year- and/or site-effects.
For example, consider the first set of missing values above where restoration takes place between 2018 and 2019 and an X indicates that data are collected in that stream-year.
| Site | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
|---|---|---|---|---|---|---|---|
| Control site 1 | x | x | x | x | x | x | |
| Control site 2 | x | x | x | x | |||
| Restoration site | x | x | x | x |
This design is connected because there exists at least one BACI contrast that is free of year- and site-effects. For example, the contrast \[(\mu_{cs1,2015}-\mu_{cs1,2018})-(\mu_{rs,2015}-\mu_{rs,2018})\] has an “expectation” of \[\mu_{CB}+CS1+Y2015 -\mu_{CA}-CS1-Y2018-\mu_{IB}-RS-Y2105 + \mu_{IA}+RS+Y2018=\mu_{CB} -\mu_{CA}-\mu_{IB}-+ \mu_{IA}\] which is the BACI contrast of interest, and where \(\mu_{CB}\), \(\mu_{CA}\), \(\mu_{IB}\), \(\mu_{IA}\) are the means of the control sites before restoration, control sites after restoration, restoration sites before restoration, and restoration site after restoration; \(Y2015\) (and similar) are the year-specific effects of 2015 and other years; and \(CS1\) \(CS2\), and \(RS\) are the site specific effects.
But now consider the second set of missing values above where there are no common years in the control or restoration site and restoration takes place between 2018 and 2019.
| Site | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
|---|---|---|---|---|---|---|---|
| Control site 1 | x | x | x | x | |||
| Control site 2 | x | x | x | x | |||
| Restoration site | x | x | x |
Now there is no contrast whose “expectation” is the BACI contrast that is free of site- and year-effects. Because no years are measured in common between the control and restoration sites, it is not possible to obtain contrast that is free of year-effects; it is possible to obtain a contrast that is free of site-effects because both control and restoration sites are measured in at least on year before and after restoration occurs.
Note that I have not used a strict definition of “expectation” because if you are willing to believe that years and site-effects are random effect with mean 0, even the design with alternating year is “connected” because the actual expectation of the year-specific effect and site-specific effect is zero. But I would hesitate to use it as the actual BACI contrast will have a much large uncertainty because it includes additional variation from the random effects that “don’t cancel”.
In designs with missing data, all of the data are used. For example, if only one site is measured in one year (and hence is not suitable for a BACI contrast), it still provides information about the year-to-year standard deviation. This occurs automatically in the (restricted) maximum likelihood estimation procedure.
So, it is not “critical” that all sites be measured in all years as long as the resulting design is connected. This will be true if there is at least one year before and one year after where the control and restoration sites are measured concurrently, but there are other combinations which are also connected. Please consult me if you have a design with many “holes”.
While traditional mixed linear models provide p-values and estimates of effect size with measures of uncertainty (standard errors and confidence intervals), the results are often difficult to interpret.
In the traditional Neyman-Pearson system, there either is evidence (\(\textit{p-value}<.05\)) or no evidence (\(\textit{p-value}>.05\)) of an effect. While confidence intervals provide some information on the size of the effect, they are also difficult to deal with in a intuitive fashion. For example, if a confidence interval for an effect size ranges from \((-1 \rightarrow 1)\), it covers the value of 0 (indicating no evidence of an effect) but such a result is qualitatively different from another interval \((-0.1 \rightarrow 1.9)\) that “barely” contains the value of 0.
A Bayesian posterior belief of an effect has a very intuitive interpretation as the weight of evidence for a hypothesis from the data, unlike a p-value which can only be interpreted as the probability of seeing this data if there was no effect. So a posterior belief of 0.95 indicates very strong evidence of an effect (given the prior and the data collected), while a posterior belief of 0.30 indicates very weak evidence of an effect (given the prior and the data collected).
As noted above the posterior belief depends both on the data and the the prior distribution for the effect size. If you have a very strong prior belief in the effect, then it will take much data to move your posterior belief from the prior. If you have very little prior knowledge of the effect size, then a vague (or non-informative) prior is used, the data tends to dominate the prior, and the posterior belief is based mostly on the data collected.
The posterior distribution is a complex function of the data (the likelihood) and the prior distribution (prior belief) and is usually difficult to obtain. However, and Hill (2006, p. 142-143) show that the estimates from a mixed model fit are also the modes of the posterior distribution for that parameter with suitably chosen, usually non-informative priors. Gelman and Su (2019) implement these ideas in the \(\mathit{arm}\) package so that a sample from the posterior distribution can be generated after a mixed model fit without having to use specialized software, but at the price that it is difficult to modify the priors. [With enough data, the priors will have limited impact on the results]. Then it is relatively easy to compute, for example, the Bayesian posterior belief that the difference in means between before and after restoration is positive which has a natural interpretation. Gellman and Hill (2006, p. 347) discuss the advantages and disadvantages of non-informative priors.
There are various reasons to specify priors, but the usual reason is that with very sparse data, the actual data is not very informative about effects and it is often difficult to fit complex models with multiple random effects. In this case, the prior can helping to stabilize computations. Additionally, priors can incorporate important information into an analysis that does not enter through the data, such as prior knowledge from many other restoration projects about the likely magnitude and direction of the effect of restoration or the size of the year-specific effects. The danger of strong priors is that the results are essentially independent of the data! With sparse data, there is no easy way to proceed but this is a natural consequence of having very little information in the sparse data.
The \(bayesglm()\) function from the \(arm\) package, or the \(stan\_lmer()\) function from the \(rstanarm\) package (a successor to the \(arm\) package) provided additional flexibility where you can specify the prior distribution for the regression coefficients as well as any unknown covariance parameters. It seems unlikely that these will be needed in this context.
Data are used to illustrate the analyses below using a combination of real and simulated data. All of the current data are treated as “Before” restoration data and streams are divided into their control/impact (impact = restoration activities) status.
The simulated data starts in 2018 and ends in 2023. The simulated 2018 year is treated as an additional “Before” year so that all streams have at least two years of “Before” data (this is needed to fit the broken-stick model below). [When the actual data are available for the full experiment, the start of the restoration period can be set separately for each watershed.]
The number of sites simulated in each stream-year is a random sample taken from the set of number of sites measured in each steam-year in the current data with a minimum of two sites measured per stream-year. The first four (or the actual number if less than 4 sites are simulated) are treated as “permanent” sites that are repeatedly measured over time. All other sites are assumed to be new sites in each stream-year.
The existing data was used to estimate the variance components associated with streams, years, and stream-year interactions. These variance components were used to generate random normal deviations on the \(log(\mathit{CUE})\) representing the effects of streams, year, and stream-years.
The current data has virtually NO sites that were repeatedly measured over time, so it is not possible to estimate the site variance component. Consequently, no site-effects were generated so it will not be surprising that models with Site effects will have estimated site variance components of zero. [You can still fit the models with permanent sites – the results would then be essentially the same as the model without site-effects.]
A simulated step effect of restoration was added to the simulated data above for the restoration streams but only for one species. The size of the step effect (\(\mu_\textit{after}-\mu_\textit{before})\) on the \(\log()\) scale is:
## Stream BKTR BLTR RNTR
## 1 Fall 2.0 0 0
## 2 Mackenzie 1.5 0 0
## 3 Rocky 2.5 0 0
The control sites were assumed to have no change in slope between the before and after periods for all species. A revised slope (\(\beta_\textit{after}-\beta_\textit{before}\)) was used to modify the CUE (on the log-scale) as shown below:
## Stream BKTR BLTR RNTR
## 1 Fall 0 0.40 0
## 2 Mackenzie 0 0.45 0
## 3 Rocky 0 0.35 0
The new slope apply to all age classes (if there are more than one). The simulated data was written to a .csv file for usage in this document.
Data will be available from the FWMIS system and it is assumed that it will have been downloaded into a suitable format, e.g. an Excel workbook or a csv file. We currently use the simulated data as described above.
One of the key issues with many database systems is accounting for sites that are visited, but NO fish, of a particular/all species were detected. In many cases, database systems only record positive counts. A naive extraction of the data from such systems would not capture the 0 counts for species looked for but not seen.
This could lead to biases in estimates. For example, consider the very simple example where:
| Site | Species | Count |
|---|---|---|
| A | RNTR | 3 |
| A | BKTR | 2 |
| B | BKTR | 2 |
| B | BLTR | 5 |
| C | RNTR | 5 |
| C | BKTR | 2 |
An incorrect estimate of mean density over the three sites for RNTR would be \((3+5)/2\) rather than the correct estimate of \((3+0+5)/3\).
To account for species with 0 counts, the database system must have two separate tables. In the first table, is a record of all visits made to all sites in each year regardless if any fish are captured or not. This table should contain information that is specific to the visit, such as the length of the segment, the personnel used, etc.
The second table would contain a count of the detected fish species for each visit. Notice that only records for species with positive counts will be included. Consequently, it will be necessary to impute counts of 0 for sites in which a visit was made but this species was not detected.
By merging the two tables, records with 0 counts for species not detected can be inserted. This document will assume that this zero imputation has been done, so the input file has a record of the count (including 0) for each species of interest at each year-site-visit combination. The CUE is then computed as the \(count / distance~sampled * 300\) to standardize per 300 m of effort for each visit.
The current FWMIS system has been set up as a relational database along these lines. One table has higher level information about a site (distance shocked, location, etc.) and a second table has dependent information on species captured (fork length, weight, sex, etc.). The Oracle query in FWMIS is now designed to provide information on sites with zero fish encountered to account for this potential problem.
Each combination of stream and year also needs to be labelled with two variables. The variable \(\mathit{BA}\) indicates if the measurement takes place before or after the restoration takes place in the watershed (e.g. using codes \(Before\) and \(After\)); the variable \(\mathit{CI}\) indicates if the stream is a control (\(C\)) or restored (treated, \(T\)).
Of course, other codes can be used and the code below may have to be modified. The modifications should be obvious in most cases. Some care will be needed in the sections of code where the Bayesian posterior beliefs are extracted as the variable name used to extract the posterior samples depends on the internal \(R\) coding of factor levels. A simple listing of the few records from the data frame generated by the arm::sim() function will show the variable names generated by \(R\).
As explained in previous reports, the response variable is the \(log(\mathit{CUE}+\textit{offset})\) where \(log()\) refers to the natural logarithmic transform and the \(\textit{offset}\) is 1/2 of the smallest non-zero value. A separate offset will be used for each combination of species, watershed, and age-class.
We read in the (simulated) CUE data that includes the before and after data and compute the \(log(CUE+offset)\).
cue.data <- read.csv("simulated.data.csv", header=TRUE, as.is=TRUE, strip.white=TRUE)
head(cue.data)
## Species AgeClass Watershed Stream Year CI BA Site CUE
## 1 BKTR AllAges Clearwater Cutoff 2023 C After ST.2023.8 1.86454039
## 2 BKTR AllAges Clearwater Cutoff 2023 C After ST.2023.5 1.39840529
## 3 BKTR AllAges Clearwater Cutoff 2023 C After SP1 1.83124502
## 4 BKTR AllAges Clearwater Cutoff 2022 C After SP3 1.53583711
## 5 BKTR AllAges Clearwater Cutoff 2023 C After ST.2023.6 0.07768918
## 6 BKTR AllAges Clearwater Cutoff 2023 C After ST.2023.12 2.54154612
tail(cue.data)
## Species AgeClass Watershed Stream Year CI BA Site CUE
## 1572 BLTR Immature Ath/NSask Thistle 2017 C Before -116.891999137736--52.796556752878 48.5313348
## 1573 BLTR Mature Ath/NSask Thistle 2018 C Before ST.2018.7 9.1169547
## 1574 BLTR Mature Ath/NSask Thistle 2018 C Before ST.2018.9 4.2091050
## 1575 BLTR Mature Ath/NSask Thistle 2018 C Before ST.2018.5 1.1063458
## 1576 BLTR Mature Ath/NSask Thistle 2018 C Before SP1 0.6238792
## 1577 BLTR Mature Ath/NSask Thistle 2018 C Before ST.2018.8 0.6405160
# compute offset for each watershed, species, age-class
offset <- plyr::ddply(cue.data, c("Watershed","Species","AgeClass"), plyr::summarize,
offset = .5* min(CUE[ CUE>0]))
offset
## Watershed Species AgeClass offset
## 1 Ath/NSask BKTR AllAges 0.006317553
## 2 Ath/NSask BLTR Immature 0.012742995
## 3 Ath/NSask BLTR Mature 0.007735561
## 4 Ath/NSask RNTR Immature 0.006292327
## 5 Ath/NSask RNTR Mature 0.004040876
## 6 Clearwater BKTR AllAges 0.008368767
## 7 Clearwater BLTR Immature 0.006419222
## 8 Clearwater BLTR Mature 0.006419222
## 9 Ram BKTR AllAges 0.005453882
## 10 Ram BLTR Immature 0.175515665
## 11 Ram BLTR Mature 0.021656866
cue.data <- merge(cue.data, offset)
cue.data$logCUE <- log(cue.data$CUE + cue.data$offset)
I’m a paranoid guy, so I do extensive checks before doing any analysis!
We first enumerate how many sites are measured on each stream in each year.
stream.year.sites <- unique(cue.data[,c("Watershed","Stream","Year","Site")] )
xtabs(~paste(Watershed,',',Stream,sep="")+Year, data=stream.year.sites)
## Year
## paste(Watershed, ",", Stream, sep = "") 2008 2010 2011 2012 2014 2015 2017 2018 2019 2020 2021 2022 2023
## Ath/NSask,Mackenzie 0 0 21 0 0 0 16 16 16 12 9 16 3
## Ath/NSask,Moon 5 0 0 0 1 1 4 16 2 8 2 4 2
## Ath/NSask,Thistle 0 0 0 0 0 0 8 9 8 3 2 2 2
## Clearwater,Cutoff 0 0 4 0 0 0 3 2 2 4 2 8 12
## Clearwater,Elk 0 17 5 0 0 0 2 8 2 2 2 2 2
## Clearwater,Limestone 0 0 5 0 0 0 1 4 2 4 4 5 4
## Clearwater,Rocky 0 12 0 5 0 0 1 9 16 5 16 17 5
## Ram,Fall 9 0 0 0 0 0 2 2 2 9 2 3 12
The actual number that are measured for each stream-year is not that important and the analyses below do not depend on the design being balanced, i.e. different number of sites can be measured for each stream-year ,and in some years, no sites could be measured.
We check that the \(\mathit{BA}\) variable has been set for each stream-year combination. First, any one year must be classified into one and only one period.
stream.year <- unique(cue.data[,c("Watershed","Stream","Year","BA")] )
xtabs(~paste(Watershed,',',Stream,sep="")+Year, data=stream.year)
## Year
## paste(Watershed, ",", Stream, sep = "") 2008 2010 2011 2012 2014 2015 2017 2018 2019 2020 2021 2022 2023
## Ath/NSask,Mackenzie 0 0 1 0 0 0 1 1 1 1 1 1 1
## Ath/NSask,Moon 1 0 0 0 1 1 1 1 1 1 1 1 1
## Ath/NSask,Thistle 0 0 0 0 0 0 1 1 1 1 1 1 1
## Clearwater,Cutoff 0 0 1 0 0 0 1 1 1 1 1 1 1
## Clearwater,Elk 0 1 1 0 0 0 1 1 1 1 1 1 1
## Clearwater,Limestone 0 0 1 0 0 0 1 1 1 1 1 1 1
## Clearwater,Rocky 0 1 0 1 0 0 1 1 1 1 1 1 1
## Ram,Fall 1 0 0 0 0 0 1 1 1 1 1 1 1
We check that the above table has only 1 and 0’s for any entry. Any entry that is >1 indicates a year that is classified in more than one period for particular stream, which is not possible.
Similarly, we look at the actual classification of each stream-year combination:
# R allows matrices to be index by the row/column names. We set up a matrix with the appropriate names
# and then assign the values based on the combination of row names and column names seen in the data.
# See the help files for '[' for more information
stream.year <- unique(cue.data[,c("Watershed","Stream","Year","BA")] )
BA.table <- matrix(" ", nrow=length(unique(cue.data$Stream)), ncol=length(unique(cue.data$Year)))
colnames(BA.table) <- as.character(sort(unique(cue.data$Year)))
rownames(BA.table) <- sort(unique(paste(cue.data$Watershed,'.',cue.data$Stream,sep="")))
BA.table[ cbind( paste(stream.year$Watershed,".",stream.year$Stream,sep=""), as.character(stream.year$Year))] <- substr(stream.year$BA,1,1)
BA.table
## 2008 2010 2011 2012 2014 2015 2017 2018 2019 2020 2021 2022 2023
## Ath/NSask.Mackenzie " " " " "B" " " " " " " "B" "B" "A" "A" "A" "A" "A"
## Ath/NSask.Moon "B" " " " " " " "B" "B" "B" "B" "A" "A" "A" "A" "A"
## Ath/NSask.Thistle " " " " " " " " " " " " "B" "B" "A" "A" "A" "A" "A"
## Clearwater.Cutoff " " " " "B" " " " " " " "B" "B" "A" "A" "A" "A" "A"
## Clearwater.Elk " " "B" "B" " " " " " " "B" "B" "A" "A" "A" "A" "A"
## Clearwater.Limestone " " " " "B" " " " " " " "B" "B" "A" "A" "A" "A" "A"
## Clearwater.Rocky " " "B" " " "B" " " " " "B" "B" "A" "A" "A" "A" "A"
## Ram.Fall "B" " " " " " " " " " " "B" "B" "A" "A" "A" "A" "A"
We check that for all years, the correct designation of before and after has been done.
Similarly, we check that the \(\mathit{CI}\) variable has been set for each stream-year combination. First, any one stream-year must be classified into either the treatment or control groups.
stream.year <- unique(cue.data[,c("Watershed","Stream","Year","CI")] )
xtabs(~paste(Watershed,',',Stream,sep="")+Year, data=stream.year)
## Year
## paste(Watershed, ",", Stream, sep = "") 2008 2010 2011 2012 2014 2015 2017 2018 2019 2020 2021 2022 2023
## Ath/NSask,Mackenzie 0 0 1 0 0 0 1 1 1 1 1 1 1
## Ath/NSask,Moon 1 0 0 0 1 1 1 1 1 1 1 1 1
## Ath/NSask,Thistle 0 0 0 0 0 0 1 1 1 1 1 1 1
## Clearwater,Cutoff 0 0 1 0 0 0 1 1 1 1 1 1 1
## Clearwater,Elk 0 1 1 0 0 0 1 1 1 1 1 1 1
## Clearwater,Limestone 0 0 1 0 0 0 1 1 1 1 1 1 1
## Clearwater,Rocky 0 1 0 1 0 0 1 1 1 1 1 1 1
## Ram,Fall 1 0 0 0 0 0 1 1 1 1 1 1 1
We check that the above table has only 1 and 0’s for any entry. Any entry that is >1 indicates a year that is classified in more than one of treatment or control for particular stream, which is not possible.
Similarly, we look at the actual classification of each stream-year combination:
stream.year <- unique(cue.data[,c("Watershed","Stream","Year","CI")] )
CI.table <- matrix(" ", nrow=length(unique(cue.data$Stream)), ncol=length(unique(cue.data$Year)))
colnames(CI.table) <- as.character(sort(unique(cue.data$Year)))
rownames(CI.table) <- sort(unique(paste(cue.data$Watershed,'.',cue.data$Stream,sep="")))
CI.table[ cbind( paste(stream.year$Watershed,".",stream.year$Stream,sep=""), as.character(stream.year$Year))] <- substr(stream.year$CI,1,1)
CI.table
## 2008 2010 2011 2012 2014 2015 2017 2018 2019 2020 2021 2022 2023
## Ath/NSask.Mackenzie " " " " "T" " " " " " " "T" "T" "T" "T" "T" "T" "T"
## Ath/NSask.Moon "C" " " " " " " "C" "C" "C" "C" "C" "C" "C" "C" "C"
## Ath/NSask.Thistle " " " " " " " " " " " " "C" "C" "C" "C" "C" "C" "C"
## Clearwater.Cutoff " " " " "C" " " " " " " "C" "C" "C" "C" "C" "C" "C"
## Clearwater.Elk " " "C" "C" " " " " " " "C" "C" "C" "C" "C" "C" "C"
## Clearwater.Limestone " " " " "C" " " " " " " "C" "C" "C" "C" "C" "C" "C"
## Clearwater.Rocky " " "T" " " "T" " " " " "T" "T" "T" "T" "T" "T" "T"
## Ram.Fall "T" " " " " " " " " " " "T" "T" "T" "T" "T" "T" "T"
We check that for all years, the correct designation of control or impact has been done.
We check for repeated sites.
# Check for repeatedly measured sites over time
sites <- unique(cue.data[,c("Watershed","Stream","Site","Year")])
n.visits.per.site <- plyr::ddply(sites, c("Watershed","Stream","Site"), plyr::summarize,
n.visits=length(Year))
cat("Sites with more than one visit over time\n")
## Sites with more than one visit over time
rep.sites <- n.visits.per.site[ n.visits.per.site$n.visits >1,]
head(rep.sites)
## Watershed Stream Site n.visits
## 38 Ath/NSask Mackenzie SP1 6
## 39 Ath/NSask Mackenzie SP2 6
## 40 Ath/NSask Mackenzie SP3 6
## 41 Ath/NSask Mackenzie SP4 5
## 100 Ath/NSask Moon -118.521635--53.628476 2
## 101 Ath/NSask Moon SP1 6
sites.red <- merge(sites, rep.sites, all.y=TRUE) # only selected the repeated sites.
xtabs(~ paste(Watershed,'.',Stream,'.',Site,sep="")+Year, data=sites.red, exclude=NULL, na.action=na.pass)
## Year
## paste(Watershed, ".", Stream, ".", Site, sep = "") 2008 2017 2018 2019 2020 2021 2022 2023
## Ath/NSask.Mackenzie.SP1 0 0 1 1 1 1 1 1
## Ath/NSask.Mackenzie.SP2 0 0 1 1 1 1 1 1
## Ath/NSask.Mackenzie.SP3 0 0 1 1 1 1 1 1
## Ath/NSask.Mackenzie.SP4 0 0 1 1 1 1 1 0
## Ath/NSask.Moon.-118.521635--53.628476 1 1 0 0 0 0 0 0
## Ath/NSask.Moon.SP1 0 0 1 1 1 1 1 1
## Ath/NSask.Moon.SP2 0 0 1 1 1 1 1 1
## Ath/NSask.Moon.SP3 0 0 1 0 1 0 1 0
## Ath/NSask.Moon.SP4 0 0 1 0 1 0 1 0
## Ath/NSask.Thistle.SP1 0 0 1 1 1 1 1 1
## Ath/NSask.Thistle.SP2 0 0 1 1 1 1 1 1
## Ath/NSask.Thistle.SP3 0 0 1 1 1 0 0 0
## Ath/NSask.Thistle.SP4 0 0 1 1 0 0 0 0
## Clearwater.Cutoff.SP1 0 0 1 1 1 1 1 1
## Clearwater.Cutoff.SP2 0 0 1 1 1 1 1 1
## Clearwater.Cutoff.SP3 0 0 0 0 1 0 1 1
## Clearwater.Cutoff.SP4 0 0 0 0 1 0 1 1
## Clearwater.Elk.SP1 0 0 1 1 1 1 1 1
## Clearwater.Elk.SP2 0 0 1 1 1 1 1 1
## Clearwater.Limestone.SP1 0 0 1 1 1 1 1 1
## Clearwater.Limestone.SP2 0 0 1 1 1 1 1 1
## Clearwater.Limestone.SP3 0 0 1 0 1 1 1 1
## Clearwater.Limestone.SP4 0 0 1 0 1 1 1 1
## Clearwater.Rocky.SP1 0 0 1 1 1 1 1 1
## Clearwater.Rocky.SP2 0 0 1 1 1 1 1 1
## Clearwater.Rocky.SP3 0 0 1 1 1 1 1 1
## Clearwater.Rocky.SP4 0 0 1 1 1 1 1 1
## Ram.Fall.SP1 0 0 1 1 1 1 1 1
## Ram.Fall.SP2 0 0 1 1 1 1 1 1
## Ram.Fall.SP3 0 0 0 0 1 0 1 1
## Ram.Fall.SP4 0 0 0 0 1 0 0 1
We see that there were essentially no repeated sites in the real data (2017 and before), but up to 4 repeated sites per stream in the simulated data (sites labelled SP1, SP2, SP3, or SP4). Not all repeated sites are measured in each year.
The simplest analysis is a before-after analysis where the mean CUE is compared between the before restoration period and the after restoration period. This assumes that restoration results in a step change in restoration, rather than a gradual trend. No control river is used, so an effect may be confounded with temporal changes between the two periods.
A preliminary plot is constructed to check for outliers etc.
# Compute the mean logCUE for each year
mean.CUE <- plyr::ddply(cue.data, c("Watershed","Stream","Species","AgeClass","Year","BA"),
plyr::summarize,
mean.logCUE=mean(logCUE, na.rm=TRUE))
# When did restoration take place for each watershed/species
year.restore <- plyr::ddply(mean.CUE, c("Watershed","Stream","Species","AgeClass"), plyr::summarize,
year.restore = max( Year[ BA == "Before"]))
# A separate plot is produced for each species
plyr::l_ply(sort(unique(cue.data$Species)), function(x,cue.data, mean.CUE){
sel.cue.data <- cue.data[ cue.data$Species==x,]
sel.mean.CUE <- mean.CUE[ mean.CUE$Species==x,]
plot.cue<- ggplot(data=sel.cue.data, aes(x=Year, y=logCUE, color=AgeClass))+
ggtitle(paste("Preliminary plot of the data for ",x))+
geom_point( position=position_jitter(h=.1))+
geom_line(data=sel.mean.CUE, aes(y=mean.logCUE), position=position_jitter(h=.1))+
facet_wrap(~paste(Watershed, '.', Stream), ncol=3)+
theme(legend.justification=c(1,0), legend.position=c(1,0))
plot(plot.cue)
}, cue.data=cue.data, mean.CUE=mean.CUE)
If outliers are detected, they should be dealt with.
The statistical model must account for the multiple measurements in a site-year (the multiple sites). This can be done in two ways
If the design were exactly balanced, i.e. the same number of measurements (sites) in each year, the two approaches would give exactly the same results.
We will use the mixed-linear model approach for every combination of species-ageclass-stream. The model is \[log(\mathit{CUE}+\textit{offset}) \sim \mathit{BA} + \mathit{YearF}(R)\] where \(\mathit{BA}\) is the before/after classification for each year, and \(\mathit{YearF}\) is a random effect of years (it must be declared as a factor to avoid being confused with a linear trend). A multiple comparison procedure will then estimate the size of the effect on both the log-scale (difference in mean logarithms) and the original scale (ratio of means) between the before and after periods.
The \(\mathit{lmerTest}\) package is used to fit the model; the \(\mathit{emmeans}\) package is used to estimate the period effect after the model is fit.
# Get all of the fits and store them in a list with the emmeans computatons
ba.fits <- plyr::dlply(cue.data, c("Watershed","Stream","Species","AgeClass"), function(x){
#browser()
x$YearF <- factor(x$Year)
# If we use the log() in the fit, the emmeans package can do the back transform automatically for us
fit <- lmerTest::lmer(log(CUE+offset) ~ BA + (1|YearF), data=x)
# we create the emmeans object and then find the pairwise difference
fit.emmo <- emmeans::emmeans(fit, ~BA)
BA.diff.log <- summary(pairs(fit.emmo), infer=TRUE)
BA.ratio <- summary(pairs(fit.emmo, type="response"))
list(fit=fit,
BA.diff.log=BA.diff.log,
BA.ratio=BA.ratio)
})
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
# Show the results for the first fit
ba.fits[[1]]$fit
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: log(CUE + offset) ~ BA + (1 | YearF)
## Data: x
## REML criterion at convergence: 368.7576
## Random effects:
## Groups Name Std.Dev.
## YearF (Intercept) 0.3024
## Residual 1.2868
## Number of obs: 109, groups: YearF, 8
## Fixed Effects:
## (Intercept) BABefore
## -1.244 -1.445
ba.fits[[1]]$BA.diff.log
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## After - Before 1.45 0.338 5.03 0.577 2.31 4.271 0.0078
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
ba.fits[[1]]$BA.ratio
## contrast ratio SE df t.ratio p.value
## After / Before 4.24 1.44 5.03 4.271 0.0078
##
## Tests are performed on the log scale
A message about a singular fit indicates that the estimated standard deviation of the year random effect for some of the streams is 0. The estimated difference/ratio between the after and before periods, along with standard errors, a 95% confidence limit, and p-value of the test for no effect is produced.
The fitted values and observed data can be plotted together. The blue line below is the fitted before/after effect and the red line joins the mean of the log(CUE+offset).
# Get the fitted values
ba.pred <- plyr::ldply(ba.fits, function(x){
#browser()
# need to convert the YearF from factor back to numeric value
pred <- cbind( x$fit@frame, pred=predict(x$fit, re.form=~0))
pred$Year <- as.numeric(as.character(pred$YearF))
pred
})
# Now to generate the observed and fitted data for each Watershed/Stream combination
npages=nrow(unique(cue.data[,c("Watershed","Stream","Species")]))*3/9
plyr::l_ply(1:npages, function(page){
plot.fit <- ggplot(data=cue.data, aes(x=Year, y=logCUE))+
ggtitle("Observed and fitted data from before/after model")+
geom_point(position=position_jitter(width=0.1))+
geom_line(data=ba.pred, aes(y=pred), color='blue', size=2)+
geom_line(data=mean.CUE, aes(y=mean.logCUE), color="red")+
geom_vline(data=year.restore, aes(xintercept=year.restore+.5))+
facet_grid_paginate(Watershed+Stream+Species~AgeClass, page=page, nrow=3, ncol=3)
plot(plot.fit)
})
This set of fits and estimated differences/ratio can be used to construct summary plots
# Get all of the fits and store them in a list with the emmeans computatons
ba.diffs <- plyr::ldply(ba.fits, function(x){x$BA.diff.log})
#head(ba.diffs)
# pretty up the output
temp<- ba.diffs
num.col <- sapply(temp, is.numeric)
temp[,num.col] <- round(temp[,num.col],2)
temp$t.ratio <- NULL
temp
## Watershed Stream Species AgeClass contrast estimate SE df lower.CL upper.CL p.value
## 1 Ath/NSask Mackenzie BKTR AllAges After - Before 1.45 0.34 5.03 0.58 2.31 0.01
## 2 Ath/NSask Mackenzie BLTR Immature After - Before 0.30 0.38 5.05 -0.68 1.29 0.46
## 3 Ath/NSask Mackenzie BLTR Mature After - Before 1.11 0.56 5.30 -0.31 2.52 0.10
## 4 Ath/NSask Mackenzie RNTR Immature After - Before 0.25 0.38 4.70 -0.75 1.25 0.54
## 5 Ath/NSask Mackenzie RNTR Mature After - Before -0.25 0.35 4.70 -1.16 0.66 0.50
## 6 Ath/NSask Moon BKTR AllAges After - Before -0.21 0.52 4.59 -1.59 1.17 0.70
## 7 Ath/NSask Moon BLTR Immature After - Before -0.08 0.53 2.74 -1.85 1.68 0.89
## 8 Ath/NSask Moon BLTR Mature After - Before 0.20 0.73 4.31 -1.77 2.17 0.80
## 9 Ath/NSask Moon RNTR Immature After - Before -0.27 0.74 2.74 -2.75 2.21 0.74
## 10 Ath/NSask Moon RNTR Mature After - Before 0.25 0.69 2.74 -2.05 2.56 0.74
## 11 Ath/NSask Thistle BKTR AllAges After - Before 0.03 0.46 2.52 -1.60 1.65 0.96
## 12 Ath/NSask Thistle BLTR Immature After - Before -0.65 0.74 3.18 -2.93 1.63 0.44
## 13 Ath/NSask Thistle BLTR Mature After - Before 0.62 0.65 3.08 -1.41 2.64 0.41
## 14 Clearwater Cutoff BKTR AllAges After - Before 0.00 0.37 7.56 -0.87 0.86 1.00
## 15 Clearwater Cutoff BLTR Immature After - Before -1.16 0.71 7.56 -2.81 0.49 0.14
## 16 Clearwater Cutoff BLTR Mature After - Before 1.19 0.67 7.56 -0.36 2.74 0.11
## 17 Clearwater Elk BKTR AllAges After - Before 0.01 0.53 9.76 -1.18 1.20 0.99
## 18 Clearwater Elk BLTR Immature After - Before -0.73 0.64 9.76 -2.16 0.71 0.29
## 19 Clearwater Elk BLTR Mature After - Before 0.65 0.67 7.85 -0.89 2.20 0.36
## 20 Clearwater Limestone BKTR AllAges After - Before 0.00 0.38 4.69 -0.99 0.99 1.00
## 21 Clearwater Limestone BLTR Immature After - Before -0.42 0.71 4.69 -2.28 1.44 0.58
## 22 Clearwater Limestone BLTR Mature After - Before 0.45 0.52 4.69 -0.92 1.82 0.43
## 23 Clearwater Rocky BKTR AllAges After - Before 2.32 0.32 6.32 1.54 3.10 0.00
## 24 Clearwater Rocky BLTR Immature After - Before 0.53 0.53 6.20 -0.76 1.82 0.35
## 25 Clearwater Rocky BLTR Mature After - Before 0.76 0.49 6.20 -0.44 1.96 0.17
## 26 Ram Fall BKTR AllAges After - Before 1.69 0.65 4.75 -0.01 3.40 0.05
## 27 Ram Fall BLTR Immature After - Before -0.28 0.88 5.55 -2.48 1.92 0.76
## 28 Ram Fall BLTR Mature After - Before 2.28 0.84 5.58 0.19 4.37 0.04
ggplot(data=ba.diffs, aes(y=paste(Watershed,".",Stream,sep=" "), x=estimate))+
ggtitle("Estimated difference (log scale) between before and after")+
geom_point()+
ggplot2::geom_errorbarh( aes(xmin=lower.CL, xmax=upper.CL), height=.01)+
xlab("Estimated difference (log-scale) and 95% ci")+
ylab("Watershed - Stream")+
geom_vline(xintercept=0)+
geom_text( label=format.p(ba.diffs$p.value,precision=.001), aes(x=-Inf), hjust=-.01, vjust=.5, size=2)+
facet_grid(Species~AgeClass)
A similar plot can be made for the ratio of the mean (not shown).
Case where the 95% confidence interval does not include the value of 0 (vertical line) indicate evidence of a difference. The actual p-value is presented in the ba.diffs data.frame constructed above if these are of interest.
The \(\mathit{arm}\) package can be used to generate samples from the posterior distribution of the effect size. Some care is needed because knowledge of the coding used by \(R\) is required – in particular, is the estimate of the fixed effect the difference of means computed as \(\mathit{mean}_\mathit{before} - \mathit{mean}_\mathit{after}\) or vice-versa.
# sample from the posterior, plot the violin plots and compute the posterior belief that after-before >0
ba.diffs.post <- plyr::ldply(ba.fits, function(x){
#browser()
post.samples <- arm::sim(x$fit, n.sims=n.sims.post)
data.frame(post.diff.A.minus.B = -post.samples@fixef[,2]) # notice that R computes before-after
})
head(ba.diffs.post)
## Watershed Stream Species AgeClass post.diff.A.minus.B
## 1 Ath/NSask Mackenzie BKTR AllAges 1.723958
## 2 Ath/NSask Mackenzie BKTR AllAges 1.199520
## 3 Ath/NSask Mackenzie BKTR AllAges 1.026186
## 4 Ath/NSask Mackenzie BKTR AllAges 1.528154
## 5 Ath/NSask Mackenzie BKTR AllAges 1.016468
## 6 Ath/NSask Mackenzie BKTR AllAges 1.964981
# compute the posterior belief that slope > 0
ba.diffs.belief <- plyr::ddply(ba.diffs.post, c("Watershed","Stream","Species","AgeClass"),
plyr::summarize,
post.belief.gt.0 = mean(post.diff.A.minus.B >0))
head(ba.diffs.belief)
## Watershed Stream Species AgeClass post.belief.gt.0
## 1 Ath/NSask Mackenzie BKTR AllAges 1.000
## 2 Ath/NSask Mackenzie BLTR Immature 0.797
## 3 Ath/NSask Mackenzie BLTR Mature 0.973
## 4 Ath/NSask Mackenzie RNTR Immature 0.746
## 5 Ath/NSask Mackenzie RNTR Mature 0.235
## 6 Ath/NSask Moon BKTR AllAges 0.353
ggplot(data=ba.diffs.post, aes(y=paste(Watershed,".",Stream,sep=" "), x=post.diff.A.minus.B))+
ggtitle("Posterior sample for difference (log scale) in After - Before mean")+
ggstance::geom_violinh()+
geom_text(data=ba.diffs.belief, label=formatC(ba.diffs.belief$post.belief.gt.0,2, format="f"),
aes(x=-Inf), hjust=-.2) +
xlab("Difference (log-scale) (After-Before)\nWith posterior belief that difference > 0")+
ylab("Watershed - Stream")+
geom_vline(xintercept=0)+
facet_grid(Species~AgeClass)
The posterior belief that the difference in means (log-scale; After-Before) is positive has a nice interpretation.
It is always important to do a model assessment to ensure that the assumptions of the model are satisfied. In this analysis the key assumptions are
It is possible to extract the estimates of the two sets of random effects and create residual and caterpillar plots.
# The dianostic plot is produced for the first fit as a demonstration
plyr::l_ply(ba.fits[2], function(x){
diag.plot <- sf.autoplot.lmer(x$fit)
plot(diag.plot)
})
The residual plot (upper left corner) should show random scatter around 0. There are only two columns to the residuals because there are only two fitted values, the mean before restoration took place and the mean after restoration took place.
The normal probability plot (upper right) for the residuals (the site-effects) should fall close to the theoretical line of fit. It is not crucial that these lowest random effects have an exact normal distribution since they are effectively averaged in the analysis and the central-limit theorem indicates that the average of any distribution will tend towards normality as the sample size increases.
The caterpillar plot (bottom left) should show no large outlier effects for specific years. In this case, the estimated standard deviation of the year-effects is 0, so all of the points (with 95% credible interval) are the same at 0.
It will be extremely difficult to detect autocorrelation over time unless you have 10+ years of data (Bence 1995) and so this is not done here.
As noted earlier, the presented analyses assume that a different set of sites is measured in each stream-year combination. However, all else being equal, there is some advantage to measuring the same site repeated over time. In this case, the above model must be modified to account for the site-specific effects. We assume that the \(Site\) variable has a unique alphanumeric code representing a particular site. The model is adjusted by including another term \((1|\mathit{Site})\) that accounts for the same site being repeatedly measured over time. Not all sites need to be repeatedly measured, and a site does not have to be repeatedly measured in all years. However, at least one site should have been measured in more than one year.
We then modify the previous fitting code to account for some with sites with multiple measurements by adding the (1|Site) term to the model. The rest of the code can then be run without any other changes.
# Get all of the fits and store them in a list with the emmeans computatons
# Account for repeated measurements on the same stream
ba.fits.rep <- plyr::dlply(cue.data, c("Watershed","Stream","Species","AgeClass"), function(x){
#browser()
x$YearF <- factor(x$Year)
# If we use the log() in the fit, the emmeans package can do the back transform automatically for us
fit <- lmerTest::lmer(log(CUE+offset) ~ BA + (1|YearF) +(1|Site), data=x)
# we create the emmeans object and then find the pairwise difference
fit.emmo <- emmeans::emmeans(fit, ~BA)
BA.diff.log <- summary(pairs(fit.emmo), infer=TRUE)
BA.ratio <- summary(pairs(fit.emmo, type="response"))
list(fit=fit,
BA.diff.log=BA.diff.log,
BA.ratio=BA.ratio)
})
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
# Extract one of the fitting results for closer examination
# In particular, look at the variance component associated with permanent Sites
ba.fits.rep[[1]]$fit
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: log(CUE + offset) ~ BA + (1 | YearF) + (1 | Site)
## Data: x
## REML criterion at convergence: 368.7503
## Random effects:
## Groups Name Std.Dev.
## Site (Intercept) 0.1238
## YearF (Intercept) 0.3024
## Residual 1.2809
## Number of obs: 109, groups: Site, 90; YearF, 8
## Fixed Effects:
## (Intercept) BABefore
## -1.250 -1.442
ba.fits.rep[[1]]$BA.diff.log
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## After - Before 1.44 0.345 5.03 0.556 2.33 4.179 0.0086
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
ba.fits.rep[[1]]$BA.ratio
## contrast ratio SE df t.ratio p.value
## After / Before 4.23 1.46 5.03 4.179 0.0086
##
## Tests are performed on the log scale
# the rest of the code has no other modifications
# Get all of the fits and store them in a list with the emmeans computatons
ba.diffs.rep <- plyr::ldply(ba.fits.rep, function(x){x$BA.diff.log})
#head(ba.diffs)
# pretty up the output
temp<- ba.diffs.rep
num.col <- sapply(temp, is.numeric)
temp[,num.col] <- round(temp[,num.col],2)
temp$t.ratio <- NULL
temp
## Watershed Stream Species AgeClass contrast estimate SE df lower.CL upper.CL p.value
## 1 Ath/NSask Mackenzie BKTR AllAges After - Before 1.44 0.35 5.03 0.56 2.33 0.01
## 2 Ath/NSask Mackenzie BLTR Immature After - Before 0.32 0.40 5.29 -0.70 1.33 0.46
## 3 Ath/NSask Mackenzie BLTR Mature After - Before 1.11 0.57 5.29 -0.32 2.54 0.10
## 4 Ath/NSask Mackenzie RNTR Immature After - Before 0.18 0.39 5.16 -0.82 1.18 0.66
## 5 Ath/NSask Mackenzie RNTR Mature After - Before -0.25 0.36 4.65 -1.19 0.69 0.51
## 6 Ath/NSask Moon BKTR AllAges After - Before 0.00 0.57 4.23 -1.54 1.54 1.00
## 7 Ath/NSask Moon BLTR Immature After - Before -0.08 0.56 2.57 -2.06 1.89 0.89
## 8 Ath/NSask Moon BLTR Mature After - Before 0.20 0.78 3.94 -1.99 2.38 0.81
## 9 Ath/NSask Moon RNTR Immature After - Before -0.27 0.79 2.57 -3.04 2.49 0.75
## 10 Ath/NSask Moon RNTR Mature After - Before 0.25 0.73 2.57 -2.32 2.83 0.76
## 11 Ath/NSask Thistle BKTR AllAges After - Before 0.03 0.49 2.44 -1.77 1.82 0.96
## 12 Ath/NSask Thistle BLTR Immature After - Before -0.13 0.80 3.87 -2.39 2.13 0.88
## 13 Ath/NSask Thistle BLTR Mature After - Before 0.61 0.70 3.29 -1.50 2.73 0.44
## 14 Clearwater Cutoff BKTR AllAges After - Before 0.16 0.37 11.92 -0.65 0.96 0.68
## 15 Clearwater Cutoff BLTR Immature After - Before -0.59 0.90 7.88 -2.68 1.50 0.53
## 16 Clearwater Cutoff BLTR Mature After - Before 1.16 0.68 7.70 -0.41 2.74 0.13
## 17 Clearwater Elk BKTR AllAges After - Before 0.01 0.62 3.00 -1.97 1.99 0.99
## 18 Clearwater Elk BLTR Immature After - Before -0.54 0.90 10.60 -2.53 1.45 0.56
## 19 Clearwater Elk BLTR Mature After - Before 0.65 0.75 3.88 -1.46 2.76 0.44
## 20 Clearwater Limestone BKTR AllAges After - Before 0.00 0.39 4.59 -1.02 1.02 1.00
## 21 Clearwater Limestone BLTR Immature After - Before -0.42 0.73 4.59 -2.34 1.50 0.59
## 22 Clearwater Limestone BLTR Mature After - Before 0.46 0.54 4.74 -0.94 1.86 0.43
## 23 Clearwater Rocky BKTR AllAges After - Before 2.33 0.33 6.34 1.54 3.11 0.00
## 24 Clearwater Rocky BLTR Immature After - Before 0.52 0.55 6.37 -0.81 1.85 0.38
## 25 Clearwater Rocky BLTR Mature After - Before 0.76 0.50 6.20 -0.45 1.97 0.18
## 26 Ram Fall BKTR AllAges After - Before 1.69 0.66 4.75 -0.04 3.42 0.05
## 27 Ram Fall BLTR Immature After - Before -0.28 0.89 5.55 -2.50 1.94 0.77
## 28 Ram Fall BLTR Mature After - Before 2.28 0.85 5.58 0.17 4.39 0.04
ggplot(data=ba.diffs.rep, aes(y=paste(Watershed,".",Stream,sep=" "), x=estimate))+
ggtitle("Estimated difference (log scale) between before and after",
subtitle="Repeated measurements on the same site accounted for")+
geom_point()+
ggplot2::geom_errorbarh( aes(xmin=lower.CL, xmax=upper.CL), height=.01)+
xlab("Estimated difference (log-scale) and 95% ci")+
ylab("Watershed - Stream")+
geom_vline(xintercept=0)+
facet_grid(Species~AgeClass)+
geom_text( label=format.p(ba.diffs.rep$p.value,precision=.001), aes(x=-Inf), hjust=-.01, vjust=.5, size=2)
The before/after analysis assumes a step-change in the mean between the before and after restoration activities. In some cases, the effect of restoration can be more gradual with incremental improvements over time.
In this section, we will fit a broken-stick trend where the slopes of the trends could be different between the two periods, but the lines are joined together at the point of restoration. For example, here is sample plot illustrating the line we wish to fit:
Interest lies if the slope changes between the pre- and post-restoration activity.
The statistical model must include a mechanism to create the broken-stick and to account for year-specific effect (process error). The model is: \[log(\mathit{CUE}+\textit{offset}) \sim Year + YearP + \mathit{YearF}(R)\] where \(\mathit{Year}\) represents the yearly trend before restoration takes place; \(\mathit{YearP}\) represents the change in trend after restoration takes place; and \(\mathit{YearF}\) again represents the (random) year-specific effects (process error).
The data structure includes a new variable \(\mathit{YearP}\) which is 0 for years prior to when restoration takes place, and then takes the value of \(\mathit{Year}-\mathit{Year~of~restoration}\). For example, if restoration took place in 2015, the new \(\mathit{YearP}\) variable is computed (in \(R\)) as data$YearP = pmax(0,data$Year - 2015).
The \(\mathit{lmerTest}\) package is used to fit the model; the \(\mathit{emmeans}\) package is used to estimate the period effect after the model is fit.
# Get all of the fits and store them in a list with the emmeans computatons
bs.fits <- plyr::dlply(cue.data, c("Watershed","Stream","Species","AgeClass"), function(x){
#browser()
x$YearF <- factor(x$Year)
# find the year of restoration
restore.year <- max(x$Year[ substr(x$BA,1,1)=="B"])
# compute the YearP variable
x$YearP <- pmax(0, x$Year - restore.year)
# If we use the log() in the fit, the emmeans package can do the back transform automatically for us
fit <- lmerTest::lmer(log(CUE+offset) ~ Year + YearP + (1|YearF), data=x)
# extract the estimates of the two slopes and the change in the slope
# also get the confidence intervas
ci <- confint(fit, "Year", quiet=TRUE, method="Wald")
init.slope <- cbind(summary(fit)$coefficients[ row.names(summary(fit)$coefficients)=="Year",,drop=FALSE], ci)
#browser()
ci <- confint(fit, "YearP", quiet=TRUE, method="Wald")
delta.slope<- cbind(summary(fit)$coefficients[ row.names(summary(fit)$coefficients)=="YearP",,drop=FALSE], ci)
# compute the final slope by fitting the same model except in reverse
x$YearP2 <- pmax(0, restore.year-x$Year)
fit2 <- lmerTest::lmer(log(CUE+offset) ~ Year + YearP2 + (1|YearF), data=x)
ci <- confint(fit2, "Year", quiet=TRUE, method="Wald")
final.slope<-cbind(summary(fit2)$coefficients[ row.names(summary(fit2)$coefficients)=="Year",,drop=FALSE], ci)
# fix the names for each
colnames(init.slope) <- make.names(colnames(init.slope))
colnames(delta.slope)<- make.names(colnames(delta.slope))
colnames(final.slope)<- make.names(colnames(final.slope))
list(Watershed=x$Watershed[1], Stream=x$Stream[1], Species=x$Species[1], AgeClass=x$AgeClass[1],
fit=fit,
init.slope=init.slope,
delta.slope=delta.slope,
final.slope=final.slope)
})
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
# Show the results for one of the fits
bs.fits[[1]]$fit
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: log(CUE + offset) ~ Year + YearP + (1 | YearF)
## Data: x
## REML criterion at convergence: 377.4876
## Random effects:
## Groups Name Std.Dev.
## YearF (Intercept) 0.6162
## Residual 1.2869
## Number of obs: 109, groups: YearF, 8
## Fixed Effects:
## (Intercept) Year YearP
## -159.11041 0.07775 0.22263
bs.fits[[1]]$init.slope
## Estimate Std..Error df t.value Pr...t.. X2.5.. X97.5..
## Year 0.07774514 0.1183609 4.236481 0.6568483 0.545256 -0.1542379 0.3097282
bs.fits[[1]]$delta.slope
## Estimate Std..Error df t.value Pr...t.. X2.5.. X97.5..
## YearP 0.2226252 0.2437155 5.237771 0.9134637 0.4010908 -0.2550483 0.7002988
bs.fits[[1]]$final.slope
## Estimate Std..Error df t.value Pr...t.. X2.5.. X97.5..
## Year 0.3003704 0.1666102 6.016491 1.802833 0.1213397 -0.02617955 0.6269203
A message about a singular fit indicates that the estimated standard deviation of the year random effect is 0. The estimated change in the slope between the after and before periods, along with standard errors, a 95% confidence limit, and p-value of the test for no change in the slope is produced.
We can examine the plots of fitted and predicted values to check the general form of the fit and again for any obvious problems:
# Get the fitted values
bs.pred <- plyr::ldply(bs.fits, function(x){
#browser()
# need to convert the YearF from factor back to numeric value
pred <- cbind( x$fit@frame, pred=predict(x$fit, re.form=~0))
pred$Year <- as.numeric(as.character(pred$YearF))
pred
})
# Now to generate the observed and fitted data for each Watershed/Stream combination
npages=nrow(unique(cue.data[,c("Watershed","Stream","Species")]))*3/9
plyr::l_ply(1:npages, function(page){
plot.fit <- ggplot(data=cue.data, aes(x=Year, y=logCUE))+
ggtitle("Observed and fitted data from broken stick model")+
geom_point(position=position_jitter(width=0.1))+
geom_line(data=bs.pred, aes(y=pred), color='blue', size=2)+
geom_line(data=mean.CUE, aes(y=mean.logCUE), color="red")+
geom_vline(data=year.restore, aes(xintercept=year.restore+.5))+
facet_grid_paginate(Watershed+Stream+Species~AgeClass, page=page, nrow=3, ncol=3)
plot(plot.fit)
})
This set of fits and estimated differences/ratio can be used to construct summary plots
# Get all of the fits and store them in a list with the emmeans computatons
bs.diffs <- plyr::ldply(bs.fits, function(x){x$delta.slope})
#head(bs.diffs)
# pretty up the output
temp<- bs.diffs
num.col <- sapply(temp, is.numeric)
temp[,num.col] <- round(temp[,num.col],2)
temp$t.value <- NULL
temp
## Watershed Stream Species AgeClass Estimate Std..Error df Pr...t.. X2.5.. X97.5..
## 1 Ath/NSask Mackenzie BKTR AllAges 0.22 0.24 5.24 0.40 -0.26 0.70
## 2 Ath/NSask Mackenzie BLTR Immature 0.25 0.17 4.72 0.20 -0.08 0.58
## 3 Ath/NSask Mackenzie BLTR Mature 0.49 0.15 3.20 0.04 0.19 0.78
## 4 Ath/NSask Mackenzie RNTR Immature 0.03 0.18 106.00 0.88 -0.33 0.38
## 5 Ath/NSask Mackenzie RNTR Mature -0.04 0.17 106.00 0.81 -0.36 0.29
## 6 Ath/NSask Moon BKTR AllAges -0.05 0.21 4.57 0.82 -0.45 0.36
## 7 Ath/NSask Moon BLTR Immature 0.13 0.17 42.00 0.46 -0.21 0.46
## 8 Ath/NSask Moon BLTR Mature -0.13 0.26 1.96 0.67 -0.64 0.38
## 9 Ath/NSask Moon RNTR Immature -0.03 0.27 4.25 0.92 -0.55 0.50
## 10 Ath/NSask Moon RNTR Mature 0.07 0.23 42.00 0.77 -0.38 0.51
## 11 Ath/NSask Thistle BKTR AllAges 0.01 0.64 31.00 0.99 -1.24 1.25
## 12 Ath/NSask Thistle BLTR Immature 0.78 0.75 0.82 0.52 -0.69 2.24
## 13 Ath/NSask Thistle BLTR Mature -0.71 0.67 31.00 0.29 -2.02 0.59
## 14 Clearwater Cutoff BKTR AllAges 0.00 0.16 34.00 0.99 -0.31 0.31
## 15 Clearwater Cutoff BLTR Immature 0.40 0.29 34.00 0.18 -0.18 0.97
## 16 Clearwater Cutoff BLTR Mature -0.37 0.28 34.00 0.18 -0.91 0.16
## 17 Clearwater Elk BKTR AllAges 0.01 0.21 39.00 0.95 -0.40 0.43
## 18 Clearwater Elk BLTR Immature 0.08 0.26 39.00 0.77 -0.42 0.58
## 19 Clearwater Elk BLTR Mature -0.01 0.30 5.15 0.97 -0.60 0.58
## 20 Clearwater Limestone BKTR AllAges 0.00 0.17 26.00 1.00 -0.33 0.33
## 21 Clearwater Limestone BLTR Immature -0.17 0.34 4.63 0.65 -0.83 0.50
## 22 Clearwater Limestone BLTR Mature 0.15 0.23 26.00 0.53 -0.31 0.61
## 23 Clearwater Rocky BKTR AllAges 0.28 0.28 5.32 0.36 -0.26 0.82
## 24 Clearwater Rocky BLTR Immature 0.47 0.21 83.00 0.03 0.06 0.87
## 25 Clearwater Rocky BLTR Mature 0.21 0.23 5.17 0.40 -0.24 0.67
## 26 Ram Fall BKTR AllAges 0.33 0.31 5.88 0.33 -0.28 0.94
## 27 Ram Fall BLTR Immature -0.18 0.35 5.16 0.63 -0.86 0.51
## 28 Ram Fall BLTR Mature 0.82 0.29 4.51 0.04 0.24 1.39
ggplot(data=bs.diffs, aes(y=paste(Watershed,".",Stream,sep=" "), x=Estimate))+
ggtitle("Estimated change in slope (log scale) between before and after")+
geom_point()+
ggplot2::geom_errorbarh( aes(xmin=X2.5.., xmax=X97.5..), height=.01)+
xlab("Estimated change in slope (log-scale) and 95% ci")+
ylab("Watershed - Stream")+
geom_vline(xintercept=0)+
facet_grid(Species~AgeClass)+
geom_text( label=format.p(bs.diffs$Pr...t.., precision=.001), aes(x=-Inf), hjust=-.01, vjust=.5, size=2)
Case where the 95% confidence interval does not include the value of 0 (vertical line) indicate evidence of a difference in the slope. The actual p-value is presented in the bs.diffs data.frame constructed above if these are of interest.
The \(\mathit{arm}\) package can be used to generate samples from the posterior distribution of the effect size. Here we can access the estimated difference in the slopes directly.
# sample from the posterior, plot the violin plots and compute the posterior belief that after-before >0
bs.diffs.post <- plyr::ldply(bs.fits, function(x){
#browser()
#cat(x$Watershed, x$Stream, x$Species, x$AgeClass,"\n") # use this to follow progress of the sampling
#if(x$Stream=="Thistle")browser()
post.samples <- arm::sim(x$fit, n.sims=n.sims.post)
data.frame(post.diff.slope = post.samples@fixef[,"YearP"]) # Year P is the difference in slope (after-before)
})
head(bs.diffs.post)
## Watershed Stream Species AgeClass post.diff.slope
## 1 Ath/NSask Mackenzie BKTR AllAges 0.46992975
## 2 Ath/NSask Mackenzie BKTR AllAges 0.03219299
## 3 Ath/NSask Mackenzie BKTR AllAges 0.19133428
## 4 Ath/NSask Mackenzie BKTR AllAges 0.35384744
## 5 Ath/NSask Mackenzie BKTR AllAges 0.29144557
## 6 Ath/NSask Mackenzie BKTR AllAges 0.27428480
# compute the posterior belief that difference (A-B) > 0
bs.diffs.belief <- plyr::ddply(bs.diffs.post, c("Watershed","Stream","Species","AgeClass"),
plyr::summarize,
post.belief.gt.0 = mean(post.diff.slope >0))
head(bs.diffs.belief)
## Watershed Stream Species AgeClass post.belief.gt.0
## 1 Ath/NSask Mackenzie BKTR AllAges 0.807
## 2 Ath/NSask Mackenzie BLTR Immature 0.937
## 3 Ath/NSask Mackenzie BLTR Mature 0.999
## 4 Ath/NSask Mackenzie RNTR Immature 0.553
## 5 Ath/NSask Mackenzie RNTR Mature 0.422
## 6 Ath/NSask Moon BKTR AllAges 0.403
ggplot(data=bs.diffs.post, aes(y=paste(Watershed,".",Stream,sep=" "), x=post.diff.slope))+
ggtitle("Posterior sample for difference (log scale) in the slope (After - Before)")+
ggstance::geom_violinh()+
geom_text(data=bs.diffs.belief, label=formatC(bs.diffs.belief$post.belief.gt.0,2, format="f"),
aes(x=-Inf), hjust=-.2) +
xlab("Change in slope (log-scale) (After-Before)\nWith posterior belief that difference > 0")+
ylab("Watershed - Stream")+
geom_vline(xintercept=0)+
facet_grid(Species~AgeClass)
The posterior belief that the difference in slopes (After-Before) is positive has a simple interpretation.
It is always important to do a model assessment to ensure that the assumptions of the model are satisfied. In this analysis the key assumptions are
It is possible to extract the estimates of the two sets of random effects and create residual and caterpillar plots.
# The dianostic plot is produced for one of the fits as a demonstration
plyr::l_ply(bs.fits[2], function(x){
diag.plot <- sf.autoplot.lmer(x$fit)
plot(diag.plot)
})
The diagnostic plot is interpreted in the same way as previously described. There are now more fitted values because of the broken-stick line makes a prediction for each year of the study.
The problem with a before/after analysis is that the observed change could be confounded with temporal effects. A Before-After-Control-Impact design uses one or more control streams to adjust for temporal effects.
In this example, the Fall stream in the Ram watershed does not have any control streams and so will be excluded from this analysis.
The key advantage of several control streams is that the scope of inference is now expanded. If there was only a single control stream, then one could argue that the effects observed are peculiar to the control stream chosen. If multiple control streams are (randomly) chosen, then one can argue that the effects are generalizable to all control streams in the watershed.
Each watershed only has one treatment stream, so it can still be argued that the results are peculiar to the treatment stream chosen. For the same reasons, it is also possible to have multiple treatment streams within a watershed.
The methods presented below will work if there are multiple control and/or multiple treatment streams but there must be replication in at least one of the groups. If a watershed only has a single treatment and a single control stream, the model below will still be applicable by removing the \(\mathit{Stream}\) random effects.
The final result from a BACI analysis is the differential change in the mean of treatment stream relative to the mean of the control stream(s). For example, suppose that the mean in the control streams changes from 0 to 1 between the before and after periods and the mean in the treatment streams changes from 3 to 6. The BACI effect is the differential change, i.e. \((6-3)=3\) vs. \((1-0)=1\) or \((6-3)-(1-0)=2\). For an analysis on the \(log(\mathit{CUE})\) the BACI effect is difference in the mean logarithm which can be translated into ratios of ratios.
As in the before/after analysis, the analysis can be done in two ways.
Again, if the design were exactly balanced, the two approaches would give exactly the same results.
The statistical model for each watershed is \[log(\mathit{CUE}+\textit{offset}) \sim \mathit{BA} + \mathit{CI} + \mathit{BA}:\mathit{CI} + \mathit{YearF}(R) + \mathit{Stream}(R)+ \mathit{YearF}:\mathit{Stream}(R)\] where the \(\mathit{BA}\) refers to the before/after effect; the \(\mathit{CI}\) refers to the control/impact effect; the \(\mathit{BA}:\mathit{CI}\) refers to the BACI effect (the differential change) which is the term of interest; the \(\mathit{YearF}\) refers to the year random effects (it must be declared as a factor to avoid being confused with a linear trend); the \(\mathit{Stream}\) term refers to the (random) differences among streams; and the \(\mathit{YearF}:\mathit{Stream}\) refers to the random effect where streams do not track each other in parallel over time.
The effects of stream will cancel in the analysis because each stream is measured before and after impact, i.e., each stream serves as a blocking variable. The effects of year will also cancel in the analysis because in each year, both control and impact streams are measured, i.e., each year also serves as a blocking variable. The last term represents inconsistency in the mean of a stream over time compared to other streams, i.e., a non-parallelism effect and is often the limiting factor in a power analysis in a BACI design.
The same model can be used if some streams are not measured in some years and if the design were unbalanced.
We exclude the watersheds that don’t have a control and treatment stream and a replicate of either the control or treatment streams.
n.streams <- plyr::ddply(cue.data, "Watershed", plyr::summarize,
n.control =length(unique(Stream[ CI=="C"])),
n.treatment=length(unique(Stream[ CI=="T"])))
n.streams
## Watershed n.control n.treatment
## 1 Ath/NSask 2 1
## 2 Clearwater 3 1
## 3 Ram 0 1
watershed.exclude <- n.streams[ n.streams$n.control==0 | n.streams$n.treatment==0,]
watershed.exclude
## Watershed n.control n.treatment
## 3 Ram 0 1
dim(cue.data)
## [1] 1577 11
cue.data.red <- cue.data[ !cue.data$Watershed %in% watershed.exclude$Watershed, ]
dim(cue.data.red)
## [1] 1454 11
The \(\mathit{lmerTest}\) package is used to fit the model; the \(\mathit{emmeans}\) package is used to estimate the BACI effect after the model is fit. Some care is needed to specify the coefficients of the BACI contrast properly. For example, in this model, the interaction (BACI) term is specified as \(\mathit{BA}:\mathit{CI}\). The \(\mathit{BA}\) variable is coded using After and Before; the \(\mathit{CI}\) variable is coded using C and I. Consequently, the BACI terms are in the order After C, Before C,After T, and Before T.
We would like the BACI contrast to be computed as the difference of difference in the means as \[\mathit{BACI} = (\mu_{\mathit{After~T}} - \mu_{\mathit{Before~T}}) - (\mu_{\mathit{After~C}} - \mu_{\mathit{Before~C}})\] and so the BACI coefficients are \((-1,1,1,-1)\) as specified below.
# Get all of the fits and store them in a list with the emmeans computatons
baci.fits <- plyr::dlply(cue.data.red, c("Watershed","Species","AgeClass"), function(x){
#browser()
x$YearF <- factor(x$Year)
# If we use the log() in the fit, the emmeans package can do the back transform automatically for us
fit <- lmerTest::lmer(log(CUE+offset) ~ BA + CI + BA:CI + (1|YearF) +(1|Stream) +(1|YearF:Stream), data=x)
# we create the emmeans object and then find the pairwise difference
fit.emmo <- emmeans::emmeans(fit, ~BA:CI)
BACI.diff.log <- summary(contrast(fit.emmo,list(baci=c(-1,1,1,-1))), infer=TRUE)
BACI.ratio <- summary(contrast(fit.emmo,list(baci=c(-1,1,1,-1)), type="response"))
list(fit=fit,
BACI.diff.log=BACI.diff.log,
BACI.ratio=BACI.ratio)
})
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
# Show the results for the first fit
baci.fits[[1]]$fit
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: log(CUE + offset) ~ BA + CI + BA:CI + (1 | YearF) + (1 | Stream) + (1 | YearF:Stream)
## Data: x
## REML criterion at convergence: 628.5947
## Random effects:
## Groups Name Std.Dev.
## YearF:Stream (Intercept) 0.00000
## YearF (Intercept) 0.29185
## Stream (Intercept) 0.09358
## Residual 1.26406
## Number of obs: 188, groups: YearF:Stream, 25; YearF, 11; Stream, 3
## Fixed Effects:
## (Intercept) BABefore CIT BABefore:CIT
## -2.65555 0.09172 1.40273 -1.50980
## convergence code 0; 1 optimizer warnings; 0 lme4 warnings
baci.fits[[1]]$BACI.diff.log
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## baci 1.51 0.422 6.77 0.505 2.51 3.578 0.0095
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
baci.fits[[1]]$BACI.ratio
## contrast ratio SE df t.ratio p.value
## baci 4.53 1.91 6.77 3.578 0.0095
##
## Tests are performed on the log scale
A message about a singular fit indicates that the estimated standard deviation of some of the random effects is 0. The estimated BACI effect along with standard errors, a 95% confidence limit, and p-value of the test for no effect is produced.
We can examine the plots of fitted and predicted values to check the general form of the fit and again for any obvious problems:
# Get the fitted values
baci.pred <- plyr::ldply(baci.fits, function(x){
#browser()
# need to convert the YearF from factor back to numeric value
pred <- cbind( x$fit@frame, pred.global=predict(x$fit, re.form=~0))
pred$pred.stream =predict(x$fit, re.form=~(1|Stream))
pred$Year <- as.numeric(as.character(pred$YearF))
pred
})
# Now to generate the observed and fitted data for each Watershed/Stream combination
npages=2 # only 2 watersheds present
plyr::l_ply(1:npages, function(page){
plot.fit <- ggplot(data=cue.data.red, aes(x=Year, y=logCUE))+
ggtitle("Observed and fitted data from BACI in each watershed")+
#geom_point(position=position_jitter(width=0.1), size=.1)+ too buzy
geom_line(data=baci.pred, aes(y=pred.global, color=CI), size=2)+
geom_line(data=mean.CUE[ !mean.CUE$Watershed %in% watershed.exclude$Watershed,], aes(y=mean.logCUE, linetype=Stream))+
geom_vline(data=year.restore[ !year.restore$Watershed %in% watershed.exclude$Watershed,], aes(xintercept=year.restore+.5))+
facet_grid_paginate(Watershed+Species~AgeClass, page=page, nrow=3, ncol=3)
plot(plot.fit)
})
This set of fits and estimated BACI effects can be used to construct summary plots
# Get all of the fits and store them in a list with the emmeans computatons
baci.effs <- plyr::ldply(baci.fits, function(x){x$BACI.diff.log})
#head(baci.effs)
#pretty up the output
temp<- baci.effs
num.col <- sapply(temp, is.numeric)
temp[,num.col] <- round(temp[,num.col],2)
temp$t.ratio <- NULL
temp
## Watershed Species AgeClass contrast estimate SE df lower.CL upper.CL p.value
## 1 Ath/NSask BKTR AllAges baci 1.51 0.42 6.77 0.51 2.51 0.01
## 2 Ath/NSask BLTR Immature baci 0.58 0.47 6.18 -0.57 1.73 0.27
## 3 Ath/NSask BLTR Mature baci 0.64 0.63 7.39 -0.84 2.11 0.34
## 4 Ath/NSask RNTR Immature baci 0.53 0.81 6.44 -1.42 2.47 0.54
## 5 Ath/NSask RNTR Mature baci -0.51 0.74 6.44 -2.28 1.27 0.52
## 6 Clearwater BKTR AllAges baci 2.30 0.41 16.09 1.42 3.17 0.00
## 7 Clearwater BLTR Immature baci 1.24 0.63 14.39 -0.11 2.59 0.07
## 8 Clearwater BLTR Mature baci 0.05 0.60 14.41 -1.23 1.33 0.94
ggplot(data=baci.effs, aes(y=Watershed, x=estimate))+
ggtitle("Estimated BACI effects (log scale)",
subtitle="BACI computed as (AT-BT)-(AC-BC)")+
geom_point()+
ggplot2::geom_errorbarh( aes(xmin=lower.CL, xmax=upper.CL), height=.01)+
xlab("Estimated BACI effect (log-scale) and 95% ci and p-value")+
ylab("Watershed")+
geom_vline(xintercept=0)+
geom_text( label=format.p(baci.effs$p.value, precision=.0001), aes(x=-Inf), hjust=-.01, vjust=1.5)+
facet_grid(Species~AgeClass)
A similar plot can be made for the BACI ratio (not shown).
Case where the 95% confidence interval does not include the value of 0 (vertical line) indicate evidence of a BACI effect. The actual p-value is also shown on the plot
We again sample from the posterior distribution of the parameters using the arm package. Some care is needed to ensure that the BACI comparison is computed in the correct direction. Notice here that we must take the negative of the BACI interaction term.
# sample from the posterior for the BACI effect, plot the violin plots and compute the posterior belief that BACI effect >0
# Again, need to be careful that the sign of the contrast for the BACI effect is correct.
baci.post <- plyr::ldply(baci.fits, function(x){
#browser()
post.samples <- arm::sim(x$fit, n.sims=n.sims.post)
data.frame(post.baci = -post.samples@fixef[,"BABefore:CIT"]) # notice that R computes before-after
})
head(baci.post)
## Watershed Species AgeClass post.baci
## 1 Ath/NSask BKTR AllAges 1.794839
## 2 Ath/NSask BKTR AllAges 1.983587
## 3 Ath/NSask BKTR AllAges 1.599511
## 4 Ath/NSask BKTR AllAges 0.717109
## 5 Ath/NSask BKTR AllAges 1.066841
## 6 Ath/NSask BKTR AllAges 1.658148
# compute the posterior belief that difference (A-B) > 0
baci.belief <- plyr::ddply(baci.post, c("Watershed","Species","AgeClass"),
plyr::summarize,
post.belief.gt.0 = mean(post.baci >0))
head(baci.belief)
## Watershed Species AgeClass post.belief.gt.0
## 1 Ath/NSask BKTR AllAges 1.000
## 2 Ath/NSask BLTR Immature 0.908
## 3 Ath/NSask BLTR Mature 0.867
## 4 Ath/NSask RNTR Immature 0.792
## 5 Ath/NSask RNTR Mature 0.235
## 6 Clearwater BKTR AllAges 1.000
ggplot(data=baci.post, aes(y=Watershed, x=post.baci))+
ggtitle("Posterior sample for BACI effect (log scale)",
subtitle="BACI computed as (AT-BT)-(AC-BC)")+
ggstance::geom_violinh()+
geom_text(data=baci.belief, label=formatC(baci.belief$post.belief.gt.0,2, format="f"),
aes(x=-Inf), hjust=-.2, vjust=1.5) +
xlab("BACI effect (log-scale) \nWith posterior belief that difference > 0")+
ylab("Watershed ")+
geom_vline(xintercept=0)+
facet_grid(Species~AgeClass)
It is always important to do a model assessment to ensure that the assumptions of the model are satisfied. In this analysis the key assumptions are
It is possible to extract the estimates of the sets of random effects and create residual and caterpillar plots.
# The dianostic plot is produced for the first fit as a demonstration
plyr::l_ply(baci.fits[1], function(x){
#browser()
diag.plot <- sf.autoplot.lmer(x$fit)
plot(diag.plot)
})
The residual plot (upper left corner) should show random scatter around 0. In this plot, we notice what appears to be a “line” of residuals. This is an artefact of the data because it is not possible to have a CUE that is negative and an offset is added to all CUE values before taking logarithms. For example, consider an observed value of CUE of 0 where the offset was .0067 (2 fish/300 m). The observed value of \(log(\mathit{CUE}+\textit{offset})=-5\). If the predicted value on the logarithmic value was \(-5\), then the residual is 0; If the predicted value on the logarithmic scale was \(-4\), then the residual is -1, etc. As noted previously, the distribution of the residuals at the lowest level is not that crucial because measurements at sites are essentially being averaged for each year.
The normal probability plot (upper right) for the residuals (the site-effects) should fall close to the theoretical line of fit. It is not crucial that these lowest random effects have an exact normal distribution since they are effectively averaged in the analysis and the central-limit theorem indicates that the average of any distribution will tend towards normality as the sample size increases.
The caterpillar plots should show no large outlier effects for specific years. In this case, the estimated standard deviation of the random effects is 0, so all of the points (with 95% credible interval) are the same at 0.
It will be extremely difficult to detect autocorrelation over time unless you have 10+ years of data (Bence 1995) and so this is not done here.
The BACI analysis in the previous section assumes a step change in the mean between the before and after periods. But often the effect of restoration increases over time and so a BACI analysis on the slopes may be more suitable.
Here four slopes are estimates: the slope in the control streams in the before period; the slope in the control streams in the after period using a broken stick model; the slope in the restoration stream in the before period, and the slope in the restoration stream in the after period.
The final result from a BACI analysis on the slopes is now the differential change in the slopes of \(log(CUE)\) in the usual BACI contrast.
As in the before/after analysis, the analysis can be done in two ways.
Again, if the design were exactly balanced, the two approaches would give exactly the same results.
The model is similar except the \(BA\) term in the BACI model of the previous section is replaced by two terms from the broken stick model using the variables \(Year\) and \(YearP\) (as in the section on the broken stick model).
The statistical model for each watershed is \[log(\mathit{CUE}+\textit{offset}) \sim \mathit{Year} +\mathit{YearP} + \mathit{CI} + \mathit{Year}:\mathit{CI} + \mathit{YearP}:\mathit{CI} + \mathit{YearF}(R) + \mathit{Stream}(R)+ \mathit{YearF}:\mathit{Stream}(R)\] where the \(\mathit{Year}\) refers to the trend in the control group in the before period; the \(\mathit{CI}\) refers to the control/impact effect; the \(\mathit{Year}:\mathit{CI}\) refers to the different slopes between control and impact in the before years; the \(\mathit{YearP}\) is the change in slope for the control group in the after period relative to the before period; and the \(\mathit{YearP}:CI\) is the BACI effect (the differential change in the slopes) which is the term of interest; the \(\mathit{YearF}\) refers to the year random effects (it must be declared as a factor to avoid being confused with a linear trend); the \(\mathit{Stream}\) term refers to the (random) differences among streams; and the \(\mathit{YearF}:\mathit{Stream}\) refers to the random effect where streams do not track each other in parallel over time.
The effects of stream will cancel in the analysis because each stream is measured before and after impact, i.e., each stream serves as a blocking variable. The effects of year will also cancel in the analysis because in each year, both control and impact streams are measured, i.e., each year also serves as a blocking variable. The last term represents inconsistency in the mean of a stream over time compared to other streams, i.e., a non-parallelism effect and is often the limiting factor in a power analysis in a BACI design.
The same model can be used if some streams are not measured in some years and if the design were unbalanced.
We exclude the watersheds that don’t have a control and treatment stream and a replicate of either the control or treatment streams.
n.streams <- plyr::ddply(cue.data, "Watershed", plyr::summarize,
n.control =length(unique(Stream[ CI=="C"])),
n.treatment=length(unique(Stream[ CI=="T"])))
n.streams
## Watershed n.control n.treatment
## 1 Ath/NSask 2 1
## 2 Clearwater 3 1
## 3 Ram 0 1
watershed.exclude <- n.streams[ n.streams$n.control==0 | n.streams$n.treatment==0,]
watershed.exclude
## Watershed n.control n.treatment
## 3 Ram 0 1
dim(cue.data)
## [1] 1577 11
cue.data.red <- cue.data[ !cue.data$Watershed %in% watershed.exclude$Watershed, ]
dim(cue.data.red)
## [1] 1454 11
The \(\mathit{lmerTest}\) package is used to fit the model; the \(\mathit{emmeans}\) package is used to estimate the BACI effect on the slopes after the model is fit. In this case, the \(\mathit{emtrends}\) function provides estimates of the slopes and changes in the slopes for the control and restoration streams. Because the coefficient of the \(YearP\) term is already the change in slopes between the \(after\) and \(before\) periods, we only need the pairwise difference of these coefficients to estimate the BACI effect on the slopes. This corresponds to the BACI contrast \[\mathit{BACI} = (\beta_{\mathit{After~C}} - \beta_{\mathit{Before~C}}) - (\beta_{\mathit{After~T}} - \beta_{\mathit{Before~T}})\] where the \(\beta\) terms refer to the SLOPES in the combination of BA and CI.
# Get all of the fits and store them in a list with the emmeans computatons
baci.slope.fits <- plyr::dlply(cue.data.red, c("Watershed","Species","AgeClass"), function(x){
#browser()
x$YearF <- as.factor(x$Year)
year.restore <- min(x$Year[x$BA=="After"])
x$YearP = pmax(0, x$Year-year.restore)
# If we use the log() in the fit, the emmeans package can do the back transform automatically for us
fit <- lmerTest::lmer(log(CUE+offset) ~ Year+YearP + CI + Year:CI + YearP:CI + (1|YearF) +(1|Stream) +(1|YearF:Stream), data=x)
# we create the emmeans object and then find the pairwise difference
fit.emmo <- emmeans::emtrends(fit, "CI", var="YearP") # This are the two changes in the slopes for C and I
BACI.diff.log <- summary(pairs(fit.emmo), infer=TRUE)
BACI.ratio <- NA # this doesn't make sense in this context
list(fit=fit,
BACI.diff.log=BACI.diff.log,
BACI.ratio=BACI.ratio)
})
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
## singular fit
# Show the results for the first fit
baci.slope.fits[[1]]$fit
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: log(CUE + offset) ~ Year + YearP + CI + Year:CI + YearP:CI + (1 | YearF) + (1 | Stream) + (1 | YearF:Stream)
## Data: x
## REML criterion at convergence: 646.7432
## Random effects:
## Groups Name Std.Dev.
## YearF:Stream (Intercept) 3.063e-01
## YearF (Intercept) 3.583e-01
## Stream (Intercept) 2.159e-06
## Residual 1.277e+00
## Number of obs: 188, groups: YearF:Stream, 25; YearF, 11; Stream, 3
## Fixed Effects:
## (Intercept) Year YearP CIT Year:CIT YearP:CIT
## 5.548e+01 -2.876e-02 -5.797e-03 -3.350e+02 1.663e-01 1.010e-01
## convergence code 0; 1 optimizer warnings; 0 lme4 warnings
baci.slope.fits[[1]]$BACI.diff.log
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## C - T -0.101 0.284 11.8 -0.721 0.519 -0.356 0.7283
##
## Confidence level used: 0.95
A message about a singular fit indicates that the estimated standard deviation of some of the random effects is 0. The estimated BACI effect on the slopes (i.e. the differential change in the slopes) along with standard errors, a 95% confidence limit, and p-value of the test for no effect is produced.
We can examine the plots of fitted and predicted values to check the general form of the fit and again for any obvious problems:
# Get the fitted values
baci.slope.pred <- plyr::ldply(baci.slope.fits, function(x){
#browser()
# need to convert the YearF from factor back to numeric value
pred <- cbind( x$fit@frame, pred.global=predict(x$fit, re.form=~0))
pred$pred.stream =predict(x$fit, re.form=~(1|Stream))
pred$Year <- as.numeric(as.character(pred$YearF))
pred
})
# Now to generate the observed and fitted data for each Watershed/Stream combination
npages=2 # only 2 watersheds present
plyr::l_ply(1:npages, function(page){
plot.fit <- ggplot(data=cue.data.red, aes(x=Year, y=logCUE))+
ggtitle("Observed and fitted data from BACI on slopes in each watershed")+
#geom_point(position=position_jitter(width=0.1), size=.1)+ too buzy
geom_line(data=baci.slope.pred, aes(y=pred.global, color=CI), size=2)+
geom_line(data=mean.CUE[ !mean.CUE$Watershed %in% watershed.exclude$Watershed,], aes(y=mean.logCUE, linetype=Stream))+
geom_vline(data=year.restore[ !year.restore$Watershed %in% watershed.exclude$Watershed,], aes(xintercept=year.restore+.5))+
facet_grid_paginate(Watershed+Species~AgeClass, page=page, nrow=3, ncol=3)
plot(plot.fit)
})
This set of fits and estimated BACI effects can be used to construct summary plots
# Get all of the fits and store them in a list with the emmeans computatons
baci.slope.effs <- plyr::ldply(baci.slope.fits, function(x){x$BACI.diff.log})
#head(baci.effs)
#pretty up the output
temp<- baci.slope.effs
num.col <- sapply(temp, is.numeric)
temp[,num.col] <- round(temp[,num.col],2)
temp$t.ratio <- NULL
temp
## Watershed Species AgeClass contrast estimate SE df lower.CL upper.CL p.value
## 1 Ath/NSask BKTR AllAges C - T -0.10 0.28 11.84 -0.72 0.52 0.73
## 2 Ath/NSask BLTR Immature C - T -0.35 0.26 10.80 -0.92 0.23 0.21
## 3 Ath/NSask BLTR Mature C - T -0.59 0.28 10.79 -1.22 0.03 0.06
## 4 Ath/NSask RNTR Immature C - T -0.04 0.31 7.12 -0.77 0.70 0.91
## 5 Ath/NSask RNTR Mature C - T 0.06 0.27 7.12 -0.58 0.69 0.84
## 6 Clearwater BKTR AllAges C - T -0.08 0.24 9.41 -0.61 0.46 0.76
## 7 Clearwater BLTR Immature C - T -0.36 0.29 7.71 -1.04 0.32 0.26
## 8 Clearwater BLTR Mature C - T -0.35 0.29 8.72 -1.00 0.30 0.25
ggplot(data=baci.slope.effs, aes(y=Watershed, x=estimate))+
ggtitle("Estimated BACI effects on slope (log scale)",
subtitle="BACI computed as (AC-BC)-(AT-BT)")+
geom_point()+
ggplot2::geom_errorbarh( aes(xmin=lower.CL, xmax=upper.CL), height=.01)+
xlab("Estimated BACI effect (log-scale) and 95% ci and p-value")+
ylab("Watershed")+
geom_vline(xintercept=0)+
geom_text( label=format.p(baci.slope.effs$p.value, precision=.001), aes(x=-Inf), hjust=-.01, vjust=1.5)+
facet_grid(Species~AgeClass)
Case where the 95% confidence interval does not include the value of 0 (vertical line) indicate evidence of a BACI effect. The actual p-value is also shown on the plot
We again sample from the posterior distribution of the parameters using the arm package. Some care is needed to ensure that the BACI comparison is computed in the correct direction. Notice here that we must take the negative of the BACI interaction term.
# sample from the posterior for the BACI effect, plot the violin plots and compute the posterior belief that BACI effect >0
# Again, need to be careful that the sign of the contrast for the BACI effect on slopes is correct.
baci.slope.post <- plyr::ldply(baci.slope.fits, function(x){
#browser()
post.samples <- arm::sim(x$fit, n.sims=n.sims.post)
data.frame(post.baci = -post.samples@fixef[,"YearP:CIT"]) # notice that R computes before-after
})
head(baci.slope.post)
## Watershed Species AgeClass post.baci
## 1 Ath/NSask BKTR AllAges -0.58593036
## 2 Ath/NSask BKTR AllAges -0.38016211
## 3 Ath/NSask BKTR AllAges -0.31455948
## 4 Ath/NSask BKTR AllAges -0.16622564
## 5 Ath/NSask BKTR AllAges -0.08253469
## 6 Ath/NSask BKTR AllAges -0.28354542
# compute the posterior belief that difference (A-B) > 0
baci.slope.belief <- plyr::ddply(baci.slope.post, c("Watershed","Species","AgeClass"),
plyr::summarize,
post.belief.gt.0 = mean(post.baci >0))
head(baci.slope.belief)
## Watershed Species AgeClass post.belief.gt.0
## 1 Ath/NSask BKTR AllAges 0.344
## 2 Ath/NSask BLTR Immature 0.089
## 3 Ath/NSask BLTR Mature 0.010
## 4 Ath/NSask RNTR Immature 0.464
## 5 Ath/NSask RNTR Mature 0.569
## 6 Clearwater BKTR AllAges 0.363
ggplot(data=baci.slope.post, aes(y=Watershed, x=post.baci))+
ggtitle("Posterior sample for BACI effect on slopes (log scale)",
subtitle="BACI computed as (AC-BC)-(AT-BT)")+
ggstance::geom_violinh()+
geom_text(data=baci.slope.belief, label=formatC(baci.slope.belief$post.belief.gt.0,2, format="f"),
aes(x=-Inf), hjust=-.2, vjust=1.5) +
xlab("BACI effect on slopes (log-scale) \nWith posterior belief that difference > 0")+
ylab("Watershed ")+
geom_vline(xintercept=0)+
facet_grid(Species~AgeClass)
It is always important to do a model assessment to ensure that the assumptions of the model are satisfied. In this analysis the key assumptions are
It is possible to extract the estimates of the sets of random effects and create residual and caterpillar plots.
# The dianostic plot is produced for the first fit as a demonstration
plyr::l_ply(baci.slope.fits[1], function(x){
diag.plot <- sf.autoplot.lmer(x$fit)
plot(diag.plot)
})
The residual plot (upper left corner) should show random scatter around 0. In this plot, we notice what appears to be a “line” of residuals. This is an artefact of the data as noted in the previous section.
The normal probability plot (upper right) for the residuals (the site-effects) should fall close to the theoretical line of fit. It is not crucial that these lowest random effects have an exact normal distribution since they are effectively averaged in the analysis and the central-limit theorem indicates that the average of any distribution will tend towards normality as the sample size increases.
The caterpillar plots should show no large outlier effects for specific years. In this case, the estimated standard deviation of the random effects is 0, so all of the points (with 95% credible interval) are the same at 0.
It will be extremely difficult to detect autocorrelation over time unless you have 10+ years of data (Bence 1995) and so this is not done here.
The impact of restoration may not be equal for all watersheds, i.e. variation may also exist in the BACI effect. In this section, we will fit a super-model for all streams that estimates the average effect of restoration across watersheds, but also how variable this effect is across watersheds.
We once again use a mixed linear model to account for the various sources of variation in the model.
The statistical model for each watershed is \[log(\mathit{CUE}+\textit{offset}) \sim \mathit{BA} + \mathit{CI} + \mathit{BA}:\mathit{CI} + \mathit{Watershed}:\mathit{CI}(R) +\mathit{Watershed}:\mathit{BA}(R)+ \mathit{Watershed}:\mathit{CI}:\mathit{BA})+ \mathit{YearF}(R) + \mathit{Stream}(R)+ \mathit{YearF}:\mathit{Stream}(R)\] where the new random effect involving the Watershed allow for variation in these effects across the watershed.
It is not necessary to have a random effect for Watershed because we have random effects for each individual stream.
It is not necessary that the restoration year is the same in all watershed as long as the data file indicates which year belong to which period for each watershed.
We implicitly assume that year-specific effects apply similarly to all watersheds. However, watershed differences in the year-effect can be assessed by adding a watershed-year interaction and looking at the size of the resulting variance component.
It would make no sense to combine results over watersheds where only one watershed is involved for a species-age class. For example, RNTR is only observed in one watershed.
Similarly, we need to exclude watersheds where no BACI design was done.
# exclude species that only appear in 1 watershed
n.watersheds <- plyr::ddply(cue.data, c("Species"), plyr::summarize,
n.watersheds=length(unique(Watershed)))
n.watersheds
## Species n.watersheds
## 1 BKTR 3
## 2 BLTR 3
## 3 RNTR 1
species.exclude <- n.watersheds[ n.watersheds$n.watersheds==1,]
species.exclude
## Species n.watersheds
## 3 RNTR 1
dim(cue.data)
## [1] 1577 11
cue.data.red <- cue.data[ !cue.data$Species %in% species.exclude$Species, ]
dim(cue.data.red)
## [1] 1269 11
# exclude watersheds that don't have a BACI design
n.streams <- plyr::ddply(cue.data.red, "Watershed", plyr::summarize,
n.control =length(unique(Stream[ CI=="C"])),
n.treatment=length(unique(Stream[ CI=="T"])))
n.streams
## Watershed n.control n.treatment
## 1 Ath/NSask 2 1
## 2 Clearwater 3 1
## 3 Ram 0 1
watershed.exclude <- n.streams[ n.streams$n.control==0 | n.streams$n.treatment==0,]
watershed.exclude
## Watershed n.control n.treatment
## 3 Ram 0 1
dim(cue.data.red)
## [1] 1269 11
cue.data.red <- cue.data.red[ !cue.data.red$Watershed %in% watershed.exclude$Watershed, ]
dim(cue.data.red)
## [1] 1146 11
The \(\mathit{lmerTest}\) package is used to fit the model; the \(\mathit{emmeans}\) package is used to estimate the BACI effect after the model is fit. Some care is needed to specify the coefficients of the BACI contrast properly. For example, in this model, the interaction (BACI) term is specified as \(\mathit{BA}:\mathit{CI}\). The \(\mathit{BA}\) variable is coded using After and Before; the \(\mathit{CI}\) variable is coded using C and I. Consequently, the BACI terms are in the order After C, Before C,After T, and Before T.
We would like the BACI contrast to be computed as the difference of difference in the mean as \[\mathit{BACI} = (\mu_{\mathit{After~T}} - \mu_{\mathit{Before~T}}) - (\mu_{\mathit{After~C}} - \mu_{\mathit{Before~C}})\] and so the BACI coefficients are \((-1,1,1,-1)\) as specified below.
# Get all of the fits for the global BACI model and store them in a list with the emmeans computatons
baci.g.fits <- plyr::dlply(cue.data.red, c("Species","AgeClass"), function(x){
#browser()
x$YearF <- factor(x$Year)
# If we use the log() in the fit, the emmeans package can do the back transform automatically for us
fit <- lmerTest::lmer(log(CUE+offset) ~ BA + CI + BA:CI +
(1|Watershed:CI)+(1|Watershed:BA)+(1|Watershed:CI:BA)+
(1|YearF) +(1|Stream) +(1|YearF:Stream), data=x)
# we create the emmeans object and then find the pairwise difference
fit.emmo <- emmeans::emmeans(fit, ~BA:CI)
BACI.diff.log <- summary(contrast(fit.emmo,list(baci=c(-1,1,1,-1))), infer=TRUE)
BACI.ratio <- summary(contrast(fit.emmo,list(baci=c(-1,1,1,-1)), type="response"))
list(fit=fit,
BACI.diff.log=BACI.diff.log,
BACI.ratio=BACI.ratio)
})
## singular fit
## singular fit
## singular fit
# Show the results for one of the fits
baci.fits[[1]]$fit
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: log(CUE + offset) ~ BA + CI + BA:CI + (1 | YearF) + (1 | Stream) + (1 | YearF:Stream)
## Data: x
## REML criterion at convergence: 628.5947
## Random effects:
## Groups Name Std.Dev.
## YearF:Stream (Intercept) 0.00000
## YearF (Intercept) 0.29185
## Stream (Intercept) 0.09358
## Residual 1.26406
## Number of obs: 188, groups: YearF:Stream, 25; YearF, 11; Stream, 3
## Fixed Effects:
## (Intercept) BABefore CIT BABefore:CIT
## -2.65555 0.09172 1.40273 -1.50980
## convergence code 0; 1 optimizer warnings; 0 lme4 warnings
baci.fits[[1]]$BACI.diff.log
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## baci 1.51 0.422 6.77 0.505 2.51 3.578 0.0095
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
baci.fits[[1]]$BACI.ratio
## contrast ratio SE df t.ratio p.value
## baci 4.53 1.91 6.77 3.578 0.0095
##
## Tests are performed on the log scale
A message about a singular fit indicates that the estimated standard deviation of some of the random effects is 0. The estimated BACI effect along with standard errors, a 95% confidence limit, and p-value of the test for no effect is produced.
We can examine the plots of fitted and predicted values to check the general form of the fit and again for any obvious problems:
# Get the fitted values
baci.g.pred <- plyr::ldply(baci.g.fits, function(x){
#browser()
# need to convert the YearF from factor back to numeric value
pred <- cbind( x$fit@frame, pred.global=predict(x$fit, re.form=~0))
pred$pred.watershed =predict(x$fit, re.form=~(1|Watershed:CI))
pred$pred.stream =predict(x$fit, re.form=~(1|Stream))
pred$Year <- as.numeric(as.character(pred$YearF))
pred
})
# Now to generate the observed and fitted data for each Watershed/Stream combination
npages=1 # only 1 page is needed
plyr::l_ply(1:npages, function(page){
plot.fit <- ggplot(data=cue.data.red, aes(x=Year, y=logCUE))+
ggtitle("Observed and fitted data from global BACI")+
#geom_point(position=position_jitter(width=0.1), size=.1)+ too buzy
geom_line(data=baci.g.pred, aes(y=pred.global, color=CI), size=2)+
geom_line(data=mean.CUE[ !mean.CUE$Watershed %in% watershed.exclude$Watershed,], aes(y=mean.logCUE, linetype=Stream))+
geom_vline(data=year.restore[ !year.restore$Watershed %in% watershed.exclude$Watershed,], aes(xintercept=year.restore+.5))+
facet_grid_paginate(Species~AgeClass, page=page, nrow=3, ncol=3)
plot(plot.fit)
})
This set of fits and estimated BACI effects can be used to construct summary plots
# Get all of the fits and store them in a list with the emmeans computatons
baci.g.effs <- plyr::ldply(baci.g.fits, function(x){x$BACI.diff.log})
#head(baci.effs)
#pretty up the output
temp<- baci.g.effs
num.col <- sapply(temp, is.numeric)
temp[,num.col] <- round(temp[,num.col],2)
temp$t.ratio <- NULL
temp
## Species AgeClass contrast estimate SE df lower.CL upper.CL p.value
## 1 BKTR AllAges baci 1.89 0.44 0.95 -4.32 8.09 0.15
## 2 BLTR Immature baci 0.91 0.38 0.86 -6.20 8.02 0.28
## 3 BLTR Mature baci 0.35 0.45 1.07 -4.55 5.25 0.57
ggplot(data=baci.g.effs, aes(y="GLOBAL", x=estimate))+
ggtitle("Estimated GLOBAL BACI effects (log scale)",
subtitle="BACI computed as (AT-BT)-(AC-BC)")+
geom_point()+
ggplot2::geom_errorbarh( aes(xmin=lower.CL, xmax=upper.CL), height=.01)+
xlab("Estimated BACI effect (log-scale) and 95% ci and p-value")+
ylab("Watershed")+
geom_vline(xintercept=0)+
geom_text( label=format.p(baci.g.effs$p.value, precision=.001), aes(x=-Inf), hjust=-.01, vjust=1.5)+
facet_grid(Species~AgeClass)
A similar plot can be made for the BACI ratio (not shown).
Case where the 95% confidence interval does not include the value of 0 (vertical line) indicate evidence of a GLOBAL BACI effect. The actual p-value is also shown on the plot.
The BACI effect may vary among the watersheds. This is measured by the standard deviation of the BACI by watershed interaction:
# Extract the BACI x watershed standard deviation
baci.g.w.baci.sd <- plyr::ldply(baci.g.fits, function(x){
#browser()
vc <- as.data.frame(VarCorr(x$fit))
data.frame(w.baci.sd = vc[ vc$grp=="Watershed:CI:BA", "sdcor"])
})
baci.g.w.baci.sd
## Species AgeClass w.baci.sd
## 1 BKTR AllAges 2.347422e-01
## 2 BLTR Immature 1.533664e-07
## 3 BLTR Mature 0.000000e+00
This provides information on the variation in the BACI effect among the watersheds.
We again sample from the posterior distribution of the parameters using the arm package. Some care is needed to ensure that the BACI comparison is computed in the correct direction. Notice here that we must take the negative of the BACI interaction term.
# sample from the posterior for the BACI effect, plot the violin plots and compute the posterior belief that BACI effect >0
# Again, need to be careful that the sign of the contrast for the BACI effect is correct.
baci.g.post <- plyr::ldply(baci.g.fits, function(x){
#browser()
post.samples <- arm::sim(x$fit, n.sims=n.sims.post)
data.frame(post.baci = -post.samples@fixef[,"BABefore:CIT"]) # notice that R computes before-after
})
head(baci.g.post)
## Species AgeClass post.baci
## 1 BKTR AllAges 2.440653
## 2 BKTR AllAges 2.129606
## 3 BKTR AllAges 2.331481
## 4 BKTR AllAges 1.742250
## 5 BKTR AllAges 1.137542
## 6 BKTR AllAges 2.140573
# compute the posterior belief that difference (A-B) > 0
baci.g.belief <- plyr::ddply(baci.g.post, c("Species","AgeClass"),
plyr::summarize,
post.belief.gt.0 = mean(post.baci >0))
head(baci.g.belief)
## Species AgeClass post.belief.gt.0
## 1 BKTR AllAges 1.000
## 2 BLTR Immature 0.993
## 3 BLTR Mature 0.795
ggplot(data=baci.g.post, aes(y="GLOBAL", x=post.baci))+
ggtitle("Posterior sample for GLOBAL BACI effect (log scale)",
subtitle="BACI computed as (AT-BT)-(AC-BC)")+
ggstance::geom_violinh()+
geom_text(data=baci.g.belief, label=formatC(baci.g.belief$post.belief.gt.0,2, format="f"),
aes(x=-Inf), hjust=-.2, vjust=1.5) +
xlab("BACI effect (log-scale) \nWith posterior belief that difference > 0")+
ylab("Watershed ")+
geom_vline(xintercept=0)+
facet_grid(Species~AgeClass)
It is possible to extract the estimates of the sets of random effects and create residual and caterpillar plots in the usual way.
# The dianostic plot is produced for the one of the fits as a demonstration
plyr::l_ply(baci.g.fits[1], function(x){
diag.plot <- sf.autoplot.lmer(x$fit)
plot(diag.plot)
})
The residual plots are interpreted in the same way as previously. Again, the “line” in the residuals is an artefact of CUE being limited below by 0 and the offset. As noted previously, the distribution of the residuals at the lowest level is not that crucial because measurements at sites are essentially being averaged for each year.
A key point when analyzing the future data is to account for stream-years-sites where no fish of a particular species was found, i.e. 0 counts must be imputed. This is a common problem with many database systems that only record positive counts. An additional table is needed in the database of the set of sites visited in each stream-year and the species to be expected, so that 0’s can be properly imputed. The FWMIS system has been programmed to do the 0 imputations, so no further imputation will be needed.
This demonstration shows several ways to analyse the upcoming data to be collected from the restoration program. The analysis become progressively more complex so deficiencies within each step are addressed.
The before-after analysis estimate of effect size for each stream will be confounded with the effects of any other temporal variable. For example, suppose that in the five years after restoration takes place, that the precipitation levels are much lower than normal leading to degradation of water quality and a reduction in CUE. This will be confounded with the difference in mean response between the after and before periods.
The before-after analysis also assumes that the change due to restoration is an immediate step function. In some cases, improvements after restoration will be gradual and cumulative as, for example, habitat slowly improves. In this case, a broken-stick model looks for a change in the trend starting at the time of restoration. In this case, the time of the change in trend is known. It is a much more difficult problem to fit a model where the break-point is to be estimated as well. This is known as the change-point problem (Toms and Lesperance, 2003) and will require many, many, many more years of data before it is feasible (typically at least 30-40 years of data!).
The BACI design uses stream(s) that have not been restored as controls to account for temporal change. In this study, multiple control streams provide assurances that the observed effects are not specific to a particular control stream but can be generalized across the set of control streams. Only having a single treatment stream, unfortunately, still leaves some question that the results are particular to this specific treatment stream, but the variability among control streams provides some (limited) information on the variation that would be expected if multiple treatment streams were used.
Finally, a global BACI analysis combined results over watersheds. Here the BACI effect is a “global” average effect. Because the restoration actions are likely quite different across watersheds, this average BACI effect must be interpreted with care.
This demonstration shows the results from a classical analysis using linear mixed models fit with REML, the standard analysis that can account for multiple sources of variation. In particular, any analysis that involves measurements over time must account for year-specific effects (process error) which, if ignored, would violate the assumption of independence among observations. In this particular case, sites within a stream-year are pseudo-replicates and cannot be analyzed directly.
A common problem with the results from Neyman-Pearson approaches (p-values and confidence intervals), is that these statistics do not have an intuitive interpretation. For example, many people will interpret a p-value of 0.01 as being the probability that the hypothesis of no effect is true. This is an erroneous interpretation. Similarly, many people interpret confidence intervals as a zone of belief of the underlying parameter value. This is again erroneous. In both cases, a Bayesian perspective is helpful, in particular, the samples from the posterior distribution of a parameter can be interpreted as the “belief” in particular values of the parameters. So it is sensible to compute the posterior belief that the effect is positive. The posterior belief can be interpreted as a measure of the weight of evidence in favor of the hypothesis. Fortunately, the arm package in \(R\) makes it particularly easy to sample from the posterior distribution of parameters in mixed linear models and so easy to compute these beliefs.
AEP (Alberta Environment and Parks). 2018. Electrofishing standard for sampling of rivers in Alberta. Fisheries Management, Policy and Operations Branch, AEP. April 2018. 24 p.
Bence, J. R. 1995. “Analysis of Short Time Series: Correction for Autocorrelation.” Ecology 76: 628–39.
Kuznetsova A, Brockhoff PB, Christensen RHB (2017). “lmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical Software, 82, 1-26. doi: 10.18637/jss.v082.i13 (URL: http://doi.org/10.18637/jss.v082.i13).
Gelman, A. and Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Gelman, A. and Su, Yu-Sung (2018). arm: Data Analysis Using Regression and Multilevel/Hierarchical Models. R package version 1.10-1. https://CRAN.R-project.org/package=arm
Lenth, R. (2019). emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.3.2. https://CRAN.R-project.org/package=emmeans
R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Schwarz. C. J. (2018). Power analysis for evaluating the effectiveness of intensive OHV reclamation activities to recover native trout along the East Slopes of Alberta, Canada. Prepared for the Alberta Ministry of Environment. 53pp.
Schwarz, C. J. (2017). Monitoring Design and Analysis using the Fish Sustainability Index (FSI). Prepared for the Alberta Ministry of Environment. 16 pp + Extensive R code.
Toms, J. D., and M. L. Lesperance. 2003. “Piecewise Regression A Tool for Identifying Ecological Thresholds.” Ecology 84 (8): 2034–41.